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Compressed sensing is a signal processing method that acquires data directly in a compressed 
form. This allows one to make less measurements than what was considered necessary to record a 
signal, enabling faster or more precise measurement protocols in a wide range of applications. Using 
an interdisciplinary approach, we have recently proposed in |T] a strategy that allows compressed 
sensing to be performed at acquisition rates approaching to the theoretical optimal hmits. In this 
paper, we give a more thorough presentation of our approach, and introduce many new results. 
We present the probabilistic approach to reconstruction and discuss its optimality and robustness. 
We detail the derivation of the message passing algorithm for reconstruction and expectation max- 
imization learning of signal-model parameters. We further develop the asymptotic analysis of the 
corresponding phase diagrams with and without measurement noise, for different distribution of sig- 
nals, and discuss the best possible reconstruction performances regardless of the algorithm. We also 
present new efficient seeding matrices, test them on synthetic data and analyze their performance 
asymptotically. 
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I. INTRODUCTION 
A. Background on compressed sensing 

When acquiring a signal, one needs to perform as many measurements as the number of unknowns. For a continuous 
signal, for instance, this translates into the Nyquist's law: in order to sample perfectly, the sampling rate must be at 
least twice the maximum frequency present in the signal. This conventional approach underlies virtually all signal 
acquisition protocols used in physics experiments, in audio and visual electronics, in medical imaging devices and so 
on. The compressed sensing (CS) approach is triggering a major evolution in signal acquisition that goes against this 
common wisdom: According to CS, one can recover signals and images perfectly using far fewer measurements, and 
this results in a gain of time, cost, and precision. To make this possible, CS relies on the fact that many signals of 
interest contain redundancy and thus are sparse in some basis (i.e they contain many coefficients close to or equal 
to zero when represented in some domain). This is the same insight used in data compression: the pictures we 
take with our cameras can be strongly compressed in the wavelet basis (almost) without the loss of quality, and this 
idea is for instance behind the JPEG 2000 algorithm. It would thus be convenient to record signals directly in a 
compressed format (thus the origin of the name "compressed sensing" ) to save both in memory space and in number 
of measurements. The CS approach aims to design measurement protocols that acquire only the necessary information 
about the signal, in some compressed form. In a second step, one uses computational power to reconstruct the original 
signal exactly [H [3] . The inverse problem posed by this second step is in general highly non-trivial. 

Mathematically, the CS problem can be posed as follows: given an A^-component signal s, one makes M measure- 
ments that are grouped into an M-component vector y, obtained from s by a linear transformation using a AI x N 
measurement matrix F, given by = X]t=i ^t^i^i with /i = 1, 2, . . . , M . The observer has freedom in the choice of 
the measurement protocol, and he knows the results of the measure (the M values in vector y) and the M x N matrix 
F (in general various kinds of noise are present, as we shall discuss below). The aim is then to reconstruct the signal 
s from the knowledge of F and y. This amounts to inverting the linear system y = Fs. However, we want to have 
M as small as possible and when M < N there are fewer equations than unknowns. The system is under-determined 
and the inverse problem is ill-defined. CS, however, deals with sparse signals s, in the sense that only K < N oi the 
components are non-zero. In the noiseless case, an exact reconstruction is possible for such signals as soon a,s M > K, 
and this condition is also a necessary one for instance in the case where the non-zero component of the signal are 
independent identically distributed (iid) real variables, drawn from a distribution with a continuous part. This ability 
to recover signals using only a limited number of measurements is crucial in many fields ranging from experimental 
physics and image processing to astronomy or systems biology, making CS a very attractive concept. 

The most widely used technique in CS is based on a development that took place six years ago thanks to the works 
of Candes, Tao, Donoho and collaborators [2H5]: they proposed to find the vector satisfying the constraints y = Fx 
which has the smallest ii norm, defined as — X]i=i l^d- This optimization problem is convex and can be solved 

using efficient linear programming techniques. They have also suggested the use of a random measurement matrix 
F with iid entries. This is a crucial point, as it makes the M measurements random and incoherent. Incoherence 
expresses the idea that objects having a sparse representation must be spread out in the domain in which they are 
acquired, just as a Dirac function or a spike in the time domain is spread out in the frequency domain after a Fourier 
transform. These ideas have led to fast and efficient algorithms, and the ^i-reconstruction is now widely used, and 
has been at the origin of the burst of interest in CS over the last few years. It is possible to compute exactly the 
performance of the ii reconstruction in the limit A — >■ oo, and the analytic study shows the appearance of a sharp 
phase transition [6]. For any signal with density p = K/N, the £i reconstruction gives indeed the exact result x — s 
with probability one only if a = M/N > a^^ (p) where a^^ (p) is, however, larger than p. The £i reconstruction is thus 
sub-optimal: it requires more measurements than theoretically necessary. 



B. Our main results 

In this paper we analyze a probabilistic reconstruction of the signal in compressed sensing, which we have introduced 
in [1]. We provide here a more detailed presentation, and we include several new results. We use a simplification 
of the belief propagation (BP) algorithm, also known as approximate message passing (AMP) [7] or generalized 
approximate message passing (G-AMP) [5] in the context of CS. The probabilistic approach is combined with an 
expectation-maximization type of learning of parameters as in [1] (which has been independently proposed in the 
context of G-AMP in [9|). We use the replica and cavity methods to analyze on one hand the asymptotic performance 
of the BP algorithm and on the other hand the information theoretical limits for signal reconstruction, and the 
associated phase transitions. For sensing matrices with iid entries there is a region of parameters (signal sparsity. 
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undersampling rate and measurement noise) in which there is a gap between the BP reconstruction and the optimal 
reconstruction. In this hard region, BP iterations are blocked in a suboptimal fixed point. We also study in detail the 
phase diagram in the presence of measurement noise and observe that the region where BP is suboptimal persists, 
but becomes smaller and eventually disappears as the noise variance grows. Analyzing the origin of this algorithmic 
barrier and thinking about an analogy with crystal nucleation in [Jj we designed and tested BP reconstruction with 
seeded measurement matrices for which this gaps shrinks or entirely disappears. The implementation of our matrices 
and of the algorithm is available at http://aspics.krzakala.org/ 

We now describe the organization of this paper and list its main contributions: 



• Optimality of the probabilistic reconstruction We review in Sec II B the well known fact that probabilistic 
inference is optimal when the signal model matches the actual signal distribution. For good performance one 
usually requires a signal model that is "close enough" to the actual signal to be inferred. The unavailability 
of such a good signal model is often at the basis of the critics of this probabilistic - Bayesian - inference. The 
situation is much more favorable in the case of noiseless compressed sensing. We noticed, and proved, in jlj that in 
the case of noiseless CS probabilistic inference is optimal even if the signal model mismatches seriously the actual 
signal, details are in Sec. |II A[ This property makes our approach very robust. In our numerical experiments 
we successfully use the Gauss-Bernoulli model even for signals that are far from having iid Gauss-Bernoulli 
components. Despite this result, in practice it turns out to be useful to incorporate expectation-maximization 
learning of parameters of the signal model, as described in Sec. |II C[ 

• The message passing reconstruction algorithm We derive in detail the reconstruction algorithm and 
discuss how it is related to the existing ones. In section [III A| we give it in a form where the messages are being 
sent between signal components and measurement components and back - this being equivalent to the relaxed- 



BP algorithm [TOl E] . In section III B we then derive a simplified form where messages "live" only on the signal 
components and on the measurement component. This form is related to the seminal Thouless- Anderson-Palmer 
(TAP) [12j equations in spin glass theory, and is equivalent to the AMP algorithm in the context of CS ^51 113). 
For measurement matrices with iid entries further simplifications of the algorithm exist, and are useful for a 



more efficient implementation, and this is shown in Sec. IIIC We also derive the BP equations for expectation 
maximization learning of parameters in Sec. |III D| 

• Asymptotic analysis of the algorithm and of the probabilistic approach We use the cavity and replica 
methods to perform two types of asymptotic analysis. On one hand using the density evolution we describe the 
behavior of the belief propagation algorithm in the limit of large systems (Sec. 
the replica method we compute the theoretical limits for reconstruction (Sec. 



IV A ) , on the other hand using 



IV B ) , which are non-trivial in 



particular in the presence of noise and by definition do not depend on the algorithm. We derive the asymptotic 
evolution for measurement matrices having iid (or iid per block) entries of zero mean and variance \/N. The 
equations are independent of the other details of the distribution of matrix elements, and these predictions thus 
hold for many type of matrices (for instance, Gaussian or discrete binary ones). This makes our results very 



robust. In Sec. IV C we then discuss the simplifications that appear in the Bayes optimal case of matching signal 



model and signal distribution. In Sec. |IVD we derive the asymptotic evolution of the parameters in expectation 



maximization learning. Finally in Sec. IV E| we summarize all these previous equations in the case of block 
measurement matrices. 

Phase transitions, phase diagrams Using both the BP reconstruction algorithm and the asymptotic analysis 
we study the phase diagram and associated phase transitions for reconstruction of the signal. We study several 
settings: The optimal Bayesian inference when the signal model matches the signal distribution in Sec. |V A 



the case when the signal model does not match the signal distribution and the phase diagram after expectation 
maximization learning in Sec. |VB[ the phase diagram in the presence of measurement noise in Sec. |V C[ and 
the reconstruction with seeding block matrices in Sec. |VI[ Note that in doing the optimal Bayesian inference 
case, we thus study the best possible reconstruction performance, regardless of the algorithm. 

Optimality achieving measurement matrices In [ll. we introduced a new type of "seeding" measurement 
matrices with which theoretically optimal reconstruction can be obtained using the BP algorithm. Such a 
"threshold saturation" was later on proved for this type of matrices in [14 (called "spatial coupling"). In 
Sec. |VI A| we discuss again our motivation for the design of seeding matrices and show that there is relatively 
a lot of freedom in implementing the concept of seeding. We give new examples of efficient seeding matrices. 



which are actually simpler and more efficient than the one we have introduced earlier. In Sec. VI B we also 
show that these matrices are effective even when the model signal in the prior is different from the actual ones. 
In Sec. |VI C| we illustrate that one can approach the optimal reconstruction limit, even in the case of noisy 
measurements. 
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• Noise-sensitivity We discuss in detail the pliase diagram and tiie performance of tlie algorithm in the presence 



of measurement noise in Sec. V C We show that there are two regions in the phase diagram. Either the BP 
approach is optimal, i.e. it provides the same mean-squared-error as would be obtained by an intractable 
exhaustive search algorithm. Or BP is suboptimal due to an existence of a metastable state - in this case 



optimality can be restored using the seeding matrices as we show in Sec. VIC Overall this shows that the 
present approach has the best achievable noise stability. 

• Rigorous versus exact It is important to notice that the density evolution that we use for asymptotic analysis 
of BP was proven to be exact for the homogeneous matrices |15j . According to a private communication with 
the authors a proof for the block matrices also exists ^16]. Therefore, our predictions on the behavior of the 
algorithm are exact. As far as our predictions for the optimal inference are concerned, although our presentation 
here is not rigorous, the predictions are exact in the context of the series of works jTTHTn]. 

Let us define here the block measurement matrices that we use in this paper to implement the seeding concept. 
Note however, that the seeding measurement matrices do not have to be block matrices. Other implementations are 
possible. We leave for future work an investigation into the optimal design for practical situations. 

The block measurement matrices i^^i are constructed as follows: The A'^ variables are divided into Lc groups of Np, 
p — 1, . . . , Lc, variables in each group. We denote Up = Np/N. And the M measurements are divided into Lr groups 
of Mq, q — 1, . . . , Lr, measurements in each group, define aqp = Mq/Np. Then the matrix F is composed of L,. x Lc 
blocks and the matrix elements F^i are generated independently, in such a way that if /i is in group q and i in group 
p then Fp_i is a random number with zero mean and variance Jq,p/N. Thus we obtain a L^ x L^ coupling matrix Jq^p- 
For the asymptotic analysis we assume that Np — >■ oo, for all p = 1, . . . , Lc and Mq — > oo for all g = 1, . . . , L^. We 
define /(/z) (/(«)) to be the index of the block to which fj. (i) belongs, Bq is the set of indices in block q. The case 
of homogeneous matrix can easily be recovered by setting Lc = Lj. = 1. Note that not all block matrices are good 
seeding matrices, the parameters have to be set in such a way that seeding is implemented (i.e. existence of the seed 



and interaction such that the seed grows). The choice of parameters is discussed in Sec. VI 

Let us note that for both the homogeneous and the block matrices the results do not depend on the details of 
the distribution of its entries, as far as its mean and variance are fixed. In our simulations we mostly use Gaussian 
distributed random entries, or ±1/A^. The later has the advantage of taking less memory space, since we can store 
them with bits and deal with the \/N separately (memory space to store the matrix is the main limitation of our 
simulations). 

Note also that throughout the paper we use matrix entries of zero mean. Physical constraints might require the mean 
to be non-zero, but our algorithm would have to be modified for such cases. The problem, however, can be transformed 
rather easily to one of zero mean. Consider indeed the system y = Fs. Summing all M values of the vector y (and 
denoting y = il/M)J2^y^ and = il/M)j:^F^,) one finds My = E^,E^F^^x^ = E^(J2^. F^^>^ = MY.,Y,s,. 
Denote y the vector whose all M components are equal to y and F the M x N matrix where the i-th column is given 
by the values Fi. Then the system y — y = (F — F)s has a matrix with zero mean entries. 



C. Related works 



Here we discuss some interesting connections to other works on compressed sensing. It is important to realize 
that our main result, namely the joint design of an algorithm and a class of measurement matrices that lead to 
optimal CS reconstruction, and their analysis, is based on three main ingredients that were previously explored in 
the literature. These ingredients are the probabilistic approach, the use of message passing algorithm for sampling 
from the probability distribution, and the design of seeding matrices. It is only the joint use of these three ingredients 
that achieves optimal reconstruction, and the understanding of the reasons owes a lot to accumulated knowledge from 
statistical physics of disordered systems (for instance, using seeding matrices with £i reconstruction is useless, because 
£i reconstruction is not limited by a glass transition, its limitation is intrinsic to the use of the £i norm). 

• The state-of-the-art method for signal reconstruction in CS is based on the minimization of the £i norm of the 
signal under the linear constraint, for an overview of this technique see [5J|S]. A number of works also adopted 
a probabilistic or Bayesian approach f20'-'2? . Generically, one disadvantage of the probabilistic approach is that 
no exact algorithm is known for evaluation of the corresponding expectations. Whereas £i minimization is done 
exactly using linear programming. In our approach, this problem is resolved with the use of belief propagation 
that turns out to be an extremely efficient heuristic. Another issue of the Bayesian approach is the choice of the 
signal model. Whereas the performance of the £i reconstruction is independent of the signal distribution, this 
is not the case for the Bayesian approach in general. We show that actually for the noiseless CS the optimal 
exact reconstruction is possible even if the signal model does not match the signal distribution. 
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• In the noiseless case of CS it is very intuitive that exact reconstruction of the signal is in principle possible 
if and only if the number of measurements is larger than the number of non-zero component of the signal, 
a > po- In a more generic case, for instance in the presence of the measurement noise it is not straightforward 
to compute the best achievable mean-squared error in reconstruction. These theoretical optimality limits were 
analyzes rigorously in very general cases by [181 119j . These results agree with the non-rigorous replica method 
as developed for CS e.g. in [23H25] . Here we analyze the theoretically optimal reconstruction using as well the 
replica method (and explicit its connection with the density evolution). 

• The belief propagation (BP) is an inference algorithm that is exact on tree graphical model and that is a powerful 
heuristic also on loopy graphical models. It was discovered independently by several communities, in coding |26j . 
in inference [57], or in statistical physics [T^]. See [IH] for good overviews. Belief propagation was used for 
CS with sparse measurement matrices by several authors, see e.g. [22l |30l [31] . In the usual setting, however, CS 
corresponds to dense measurement matrices, hence a fully connected graphical model with continuous variables, 
the canonical form of BP iterations is intractable for such a case. However, neglecting only factors that go to 
zero in the large system size limit, the iterative equations can be written only for the means and variances of the 
corresponding probability distributions. Such a belief propagation algorithm was used in CS under the name 
relaxed BP (rBP) [Tni[II]. Again, by neglecting only o(l) terms rBP can be further simplified as shown for £i 
reconstruction in [3 [13] this version of the message passing was called approximate message passing (AMP), 
it is equivalent to the Thouless-Anderson-Palmer (TAP) [12] equations in spin glass theory. The AMP was 
further generalized (G-AMP) to the case of a general signal model in [5|[T3]. The algorithm that we use here is 
equivalent to G-AMP. We, however, provide an independent derivation. 

• The performance of the belief propagation algorithm can be analyzed analytically in the large system limit. 
This can be done either using the replica method, as in [32] , or using density evolution. An asymptotic density- 
evolution- like analysis of the AMP algorithm, called state evolution, was developed in W] , and more generally 
in [TS]. State evolution is the analog of density evolution for dense graphs. General analysis of algorithmic phase 
transitions for G-AMP was presented in [33]. In this paper we perform the same density evolution analysis for 
other variants of the problem (with learning, where the signal model does not match the signal distribution, 
with noise, etc.), without the rigorous proofs. Our main point is to analyze and understand the phase transitions 
that pose algorithmic barriers to the message passing reconstruction. 

• In cases when the signal distribution is not known, we can use expectation maximization (EM) to learn the 
parameters of the signal model [34) . EM learning with the expectation step being done with BP was done in 
e.g. [35]. In the context of CS, the EM was applied together with message passing reconstruction in [1]. An 
independent implementation along the same ideas also appeared in [3] under the name EM-GAMP algorithm 
(where EM stands for expectation-maximization) . All the predictions made in the present paper thus also apply 
to the EM-GAMP algorithm. 

• Based on our understanding of the properties of the algorithmic barrier encountered by the message passing 
reconstruction algorithm, we have design special seeded measurement matrices for which reconstruction is pos- 
sible even for close-to-optimal measurement rates. These matrices are based on the idea of spatial coupling that 
was developed first in error correcting codes [3S1|37], see [35] for more transparent understanding and results. 
Several other applications of the same idea exist, in different contexts. For an overview see [39| . 

• The use of spatial coupling was first suggested for compressed sensing in [40], where the authors observed 
an improvement over the reconstruction with homogeneous measurement matrices (see Fig. 5 in [40] ) . They, 
however, did not combine all the key ingredients to achieve reconstruction up to close to the theoretical limit 
a = Po, as we did in [T]. Their implementation of belief propagation was also not using the simplification under 
which only mean and variance of the messages are needed, hence the algorithm was not competitive speed-wise. 

We introduced seeded measurement matrices for CS in [1 , and showed there, both numerically and using 
the density evolution, that with such matrices it is possible to achieve the information theoretically optimal 
measurement rates. The design was motivated by the idea of crystal nucleation and growth in statistical 
physics. Subsequent work [14 justified this threshold saturation rigorously in the special case when the signal 
model corresponds to the signal distribution, but also more generally using the concept of Renyi information 
dimension instead of sparsity, as in [iTl 119) . Numerical experiments with seeded non-random (Gabor-type) 
matrices were also performed in [4T] . 
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II. PROBABILISTIC RECONSTRUCTION IN COMPRESSED SENSING 



The definition of the compressed sensing problem as studied in this paper is as follows 

N 

2/p = X! + ^^ = l,---,M, (1) 

2 = 1 

where Si are the signal elements, out of which only K — poN are non-zero, < po < 1- We denote by 0o the 
asymptotic empirical distribution of the non-zero elements. F^i are the elements of a known measurement matrix, 
Tjf^ are the known result of measurements, and is Gaussian white noise on the measurement with variance A^. 
We denote by a = M/N the number of measurements per variable. The goal of CS is to find an approach (i.e. 
measurement matrix and a reconstruction algorithm) that allows reconstruction with as low values of a as possible. 

In the asymptotic theoretical analysis we will be interested in the case of large signals N ^ oo, we will keep signal 
density po and measurement rate a of order one. We also want to keep the components of the signal and of the 
measurements of order one, hence we consider the elements of the measurement matrix to have mean and variance of 
order 0{1/N). 

We shall adopt a probabilistic inference approach to reconstruct the signal. The aim is to sample a vector x from 
the following probability measure 



1 1 



where Z, the partition function, is a normalization constant. Here we model the signal as stochastic with iid entries, 
the fraction of non-zero entries being p > and their distribution being (f), we restrict ourselves to functions (j){x) < co 
with finite variance. 

We stress that in general the signal properties are not known and hence (unless stated otherwise) we do not assume 
that the signal model matches the empirical signal distribution, p = pQ, A = Ag, nor (j) = (j)Q. Most previous 
approaches to reconstruction in CS can be stated in the form ([2]), e.g. the £i minimization is equivalent to p = 1 
and Laplace function (p. One crucial point in our approach is using p < 1 which includes the fact that one searches a 
sparse signal in the model of the signal. 

Eq. ^ can be seen as the Boltzmann measure on the disordered system with Hamiltonian 

iJ(x) = - 5: log [(1 - p)5(x0 + p0(xO] + ^ ^^2' '''''' , (3) 

i—l M— 1 ^ 

where the "disorder" comes from the randomness of the measurement matrix F^^i and the results y^. Stated this 
way, the problem is similar to a spin glass with A'' particles interacting with a long-range disordered potential. The 
signal X = s is a very special configuration of these particles, that we can call "planted" , which was used to generate 
the problem (i.e. the value of the vector y). In this sense all inference problems are equivalent to planted spin glass 
models. 



A. Optimality in the noiseless case 

In the noiseless case, A^ — > 0, sampling from the measure -P(x) leads to exact reconstruction as long as a > po and 
the support of the function (j) contains all the non-zero elements of the signal (i.e. an arbitrary finite function of finite 
variance supported on real numbers for general real entry signals). In particular the density and the distribution of 
the true signal does not need to be known, i.e. p ^ po and 4> ^ 4'q- This is a strong optimality property that was 
proven in the large size limit A'^ — >■ oo in [T] and that can be seen as follows. 

Define an auxiliary partition function Y[D) as the normalization of the measure -P(x) restricted to configurations 
at a mean-squared distance D from the signal s, i.e. 

N N M 

Y^{D)^ / j]d.T,n[(i-p)'5(^o+M^.)]n^^^"^^^^^^'^^'^''~'*^''' (4) 

JBn{s)tJi f=i ;:=iV27rA 

where Bd{s) is the sphere centered on s, defined by (^j^ Yld^ii^i ~ ^iY = ■ When D ^ Q and A ^ 0, the N 
dimensional integral in Q involves a product of (1 — po + oi)N Dirac delta functions. Hence as long as a > po the 
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function Ya{D) diverges as D 0, A ^ 0. This holds for every matrix F and every function cj) as long as it is 
supported on all the elements of s. 

In a second part of the optimality argument one needs to show that limA^o ^a(^)/^a(0) = whenever D > 0. 
First note that only configurations that solve all the M linear equations give a non-zero contribution to Q. Second, 
it is known that the signal s is the solution of the linear system with the largest number of zero elements [5 hence all 
the other solutions of the linear system have a negligible contribution to the integral (necessarily, a smaller number 
of Dirac delta functions remains after the integration) . 

Given this result, it then follows that for any po-dense original signal s, and any a > pa, the probability P(s) of the 
original signal goes to one when A — > 0. This result holds as long as the configuration minimizing the norm equals 
the original signal s. Remarkably this optimality holds independently of the distribution (f)Q of the original signal, 
which does not even need to be iid. Hence in the noiseless case, sampling x proportionally to the measure -P(x) gives 
the exact reconstruction in the whole region a > pQ. 



B. The Bayesian optimality and the Nishimori conditions 

The probabilistic approach can also be recovered from a Bayesian point of view. Indeed, given F and y, from Bayes 
theorem, we have 

P(x|F y) = (5) 

The value of measurements y given the knowledge of the matrix F and the signal x is, by definition of the problem, 
given by P(y|F,x) = O^^ii ^iVfj. - S^Ii Ffj-i^i) in the noiseless case, and by 

M 

= n , (6) 

with random Gaussian measurement noise of variance A^, for measurement p.. To express the probability P(x|F) we 
consider that the signal does not depend on the measurement matrix (which is true in all practical situations we are 
aware of). Further, in this paper, we do not aim to exploit possible correlations in signal entries (which could only 
improve the result of inference) and hence we model the signal as an iid: 

N 

P{^\F) ^lliil - p)S{x.) + pcPix.)] . (7) 
Thus the posterior probability of x after the measurement of y is given by 

AT M 

P(x|F, y) = — — n [(1 - p)d{x,) + p0(x.)] n , (8) 



where Z(y,F) — P(y|F) is again the normalization constant. This is nothing else than the P(x) in Eq. ([2|). 

We remind the reader that in the noiseless case, Aq = A^ = 0, we have the optimality result for an arbitrary 
signal, as described in the previous section. However, for the case with noise, if the true density of the signal, poj 
the measurement noise, Aq, and the asymptotic empirical distribution of the signal, (/>o, are not known then sampling 
from ([s]) is in general not optimal. 

However, if we assume knowledge of the true density of the signal, p — po, the measurement noise, A — Aq, and 
the asymptotic empirical distribution of the signal, (j) = then we just described the Bayes-optimal way to infer the 
signal s from the knowledge of the matrix F and the measurements y. In particular, an estimator x* that minimizes 
the mean-squared error with respect to the original signal s, defined as E = "^f^iixi — Si)^ /N, is then obtained from 
averages of Xi with respect to the probability measure P(x|F,y), i.e., 

X* = y dxi Xi i^i{xi) , (9) 

where I'iixi) is the marginal probability distribution of the variable i 



:^,ix,)= / P(x|F,y). (10) 
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In the remainder of this article we will be using this estimator. To give another example, the optimal estimator that 
minimizes the mean "absolute value" error AVE = X]t=i ~ Si\/N is given by the median of the marginal probability 

There are important identities that hold for the Bayes-optimal inference and that simplify many of the calculations 
that follow. In the physics of disordered systems these identities are known as the Nishimori conditions [HI 
Basically, the Nishimori conditions follow from the fact that the planted configuration (i.e. the original signal) is 
an equilibrium configuration with respect to the Boltzmann measure Hence many properties of the planted 

configuration can be computed without its knowledge by averaging over the distribution 

To derive the Nishimori conditions, consider the measurement matrix F fixed and for simplification let us drop the 
dependence on F from the notation. Consider a function A(x) depending on a "trial" configuration x. We define the 
"thermodynamic average" of A as 

(A(x))^y'dxA(x)P(x|y), (11) 

where P(x|y) is given by Eq. ([s]). Similarly, for a function A(xi,X2) that depends on two trial configurations Xi and 
X2, we define 

((^(xi,X2))) = J dxi J dx2A(xi,X2)P(xi|y)P(x2|y), (12) 

For a function B that depends on the measurement y and on the signal s we define the "disorder average" as 

[B{s, y)] = Jdyjds P{s) P{y\s)B{s, y) , (13) 

where the signal distribution P(s) is given by Eq. Q, and P(y|s) is the probability of a measurement y given the 
signal s, as in Eq. (|6|. Note that if B does not explicitly depend on s then we have [i?(y)] = / dyZ{y)B{y) because 
Z{y) — J dsP(s) P(y|s). Using these definitions we obtain 

[(A(x,s))] =JdyJ dsP(s)P(y|s) J dx A(x, s) P(x|y) = J dyZ(y) 1 ds 1 dx A(x, s):^^^^|^P(x|y) 

= J dyZ(y) J dxi J dx2A(xi,X2)P(x2|y)P(xi|y) = [((A(xi, X2)))] , (14) 



where in the 3rd equality we renamed variables as s = 2:2 and x — xi. Eq. ( [14^ is the general form of the Nishimori 
condition. 

We remind the reader that for many thermodynamic quantities the self- averaging property holds, i.e. for large 



system sizes the quantity (A(x, s)) converges to the average over disorder of the same quantity. Eq. (14 1 provides a 
rather general form of the Nishimori condition that holds for inference problems where the model for signal generation 
is known. 

To give specific examples, let us define m = X]i=i ^i^i/^ = s-x and g = xi •X2. Then we have in the thermodynamic 
limit [(to)] = [{q)]. Due to self-averaging we also have to = g if x, xi, and X2 were samples from the distribution 
P(x|y). Defining Q = x • x, and using the Nishimori condition, we get Q — pvaicj). 

C. Expectation maximization learning 

In general, one does not know the true density of the signal, po, the measurement noise, Aq, nor the asymptotic 
empirical distribution of the signal, (jjo (or its parameters). These parameters can be learned within the Bayesian 
approach, in a way similar to the expectation maximization algorithm [311 SH El]- Let us denote 9 as the ensemble 
of these unknown parameters. Given the matrix F and measurement vector y, the probability that the parameters 
take a given set of values 9 is 

P(0|F, y) = |M J dxP(y, x|P, 9) cc P(0|F)Z(0) , (15) 
where Z{9) is the normalization from (|8| with a given set of parameters 9 

Z{p,x,a.A)^ T[dx,T[ {l~p)S{x,) + ^^e-'-^ T[ ^^^e'^^y^-^^-^ (16) 
J tl ti L V27ra J ^^^^ V2nA 



10 



Considering that without knowing the measurements y we have no prior knowledge of 9, looking for the most probable 
value of parameters is equivalent to maximizing the partition function with respect to the parameters. Even if we do 
have a prior knowledge of 0, in the situations considered in this article the partition function scales exponentially in 
N and hence for large N and function P{6\F) independent of iV, and maximizing Z{0) is still the right thing to do. 

In what follows, in order to learn parameters 9 we will hence derive stationary equations for the partition func- 
tion Z{9) (or its logarithm). Remarkably, in many settings this leads to simple iterative equations for learning of 
parameters. 



III. THE BELIEF PROPAGATION RECONSTRUCTION ALGORITHM FOR COMPRESSED SENSING 

Exact computation of the averages (see Eq. ^) requires exponential time and is thus intractable [45]. To approx- 
imate the expectations we will use a variant of the belief propagation (BP) algorithm [551 HH HHj ■ Indeed, message 
passing has been shown very efficient in terms of both precision and speed for the CS problem. Our form of the 
message passing algorithm is closely related to the approximate message passing of [7] and is a special case of the 
generalized AMP of [3 [13]. We provide here an independent derivation of the algorithm. 



A. Belief Propagation recursion 



The canonical BP equations for the probability measure P(x|F,y), Eq. ([2]), are expressed in terms of 2MN "mes- 
sages", mj^^(xj) and mj_j^^(xj), which are probability distribution functions. They read: 



mi^^{xi) = — ^ [(1 - p)i{xi) -\- p<i){xi)\ J]^ ■m^^i{x^) , 



(17) 
(18) 



77^ 



where Z^"*'* and Z'^^ arc normalization factors ensuring that J dx^ m^_>.,;(a;i) — J dximi^fj_{xi) = 1. These coupled 
integral equations for the messages are too complicated to be of any practical use. However, in the large iV limit, 
when the matrix elements F^i scale like \/\/N, one can simplify these canonical equations. 
Using the Hubbard-Stratonovich transformation 



e 2A 



dA e 2A+ A ^ 



for Lu = (^j^i Ff^jXj) we can simplify Eq. (17) as 



1 



dAe Y[ 



dxjmj^^,{xj)e '^'^ ^ ' 



(19) 



(20) 



Now we expand the last exponential around zero because the term is small in N, we keep all terms that are of 
0{1/N). Introducing means and variances as new "messages" 



we obtain 



dAe 



Performing the Gaussian integral over A, we obtain 



n 



-lii^^f^{y^-F^,x.+^\) + -^i^^{y^-F^,x,+i\f 



(21) 
(22) 

(23) 



1 



27r 



A 



(24) 
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where the normahzation Z^^* contains all the -independent factors, and we have introduced the scalar messages: 



A 



F2. 



B 



(25) 
(26) 



The noiseless case corresponds to = 0. 

To close the equations on messages ai_j.^ and Vi^fj. we notice that 



1 

(Xi) = -^r [(1 - p)5{x^) + p(f>{xi)] e"T S^TT^M T.-,^^ 



(27) 



ai= j (ixiX^mi{xi) , 

Axi xf mi{xi) — a. 



Messages ai_>.p and Vi^f^ are respectively the mean and variance of the probability distribution mi^^{xi). It is also 
useful to define the local beliefs a,; and Vj as 

(28) 
(29) 

(30) 

(31) 

(32) 
(33) 



2 

i I 



where 

1 

m,{xi) = [{I - p)5{xi) + p(l){x^)]e-^'^-<^-'^^^''^'^-<^''^^ 

For a general function (j){xi) let us define the probability distribution 

1 



M^{^\R,x) 



Z(S2,i?) 



[{I - p)5{x) + pcj>{x)] 



V27rS 



e 



where Z{Y?,R) is a normalization. We define the average and variance of as 



fai^^,R) = J dxxM{J:'',R,x), 
fc{^^.,R) = J dxx^MiJ:\R,x)- f^i^^R), 



(where we do not write explicitly the dependence on (j)). We give an explicit form for these two functions for the 
Gauss-Bernoulli signal model, Eqs. ( 67p8 ), and for the mixture of Gaussians signal model in Appendix [Cj Notice 
that: 



/a(s^ i?) = R + ^^^ iogi(s', R) , 

^(I]^i^) = E2^^(E^i?). 



The closed form of the BP update is 

^i^fi — Ja 



E-y^^ B-r- 



fc 



1 ^75^/^ B^^i 



ai = fa 
Vi= fc\ 



1 



7 ^7^« 
^E7^7->-* E7^7->-*^ 
1 J2-y Bj^i > 



E7 ^7-i-i E7 ^ 



(34) 
(35) 

(36) 
(37) 



7 ^7->* 



For a general signal model (pixi) the functions fa and /c can be computed using a numerical integration over Xi. In 
special cases, like the case of Gaussian (jj which we use in practice, these functions are easily computed analytically and 
are given in Eqs. (67 [68). Eqs. ( 2l][22 ) together with (25 26) and (27) then lead to closed iterative message passing 
equations, which can be solved by iterations. There equations can be used for any signal s, and any matrix F. When 
a fixed point of the BP equations is reached, the elements of the original signal are estimated as x* = ai, and the 
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corresponding variance Vi can be used to quantify the correctness of this estimate. Perfect reconstruction is found 
when the messages converge to a fixed point such that = Si and Vi = 0. 

A message passing algorithm equivalent to the one that we have just described was used in [TT], where it was called 
"relaxed belief propagation" . In [TT] , it was used as an approximate algorithm for the case of a sparse measurement 
matrix F. In our case, the matrix is not sparse, and the use of mean and variances instead of the canonical BP 
messages is exact in the large N limit, thanks to the fact that the matrix is not sparse (a sum like FfiiXi contains 



of order N non-zero terms), and each element of the matrix F scales as 0{1/VN) 



B. The TAP form of the message passing algorithm 



In the message-passing form of BP described above, 2M x N messages are sent, one between each variable component 
i and each measurement, in each iteration. In fact, it is possible to rewrite the BP equations in terms of A^ -|- M 
messages instead of 2 M x A^, always within the assumption that the F matrix is not sparse, and that all its elements 
scale as 0{1/\/N). In statistical physics terms, this corresponds to the Thouless- Anderson-Palmer equations (TAP) 
|12j used in the study of spin glasses. In the large A^ limit, these are asymptotically equivalent (only o(l) terms are 
neglected) to the BP equations. Going from BP to TAP is, in the compressed sensing literature, the step to go from 
the rBP [TT] to the AMP [7] algorithm. Let us now show how to take this step. 



In the large N limit, it is clear from ( 36||37 1 that the messages ai^^ and Vi^^ are nearly independent of /i. However, 



one must be careful to keep the correcting "Onsager reaction terms" . Let us define 



i 



(38) 
(39) 



Then we have 



V - F"^ V . 



V, - Fl^v,^, 



F2. 
^ fit 



E 



V F^ ^ 



(40) 
(41) 



In order to compute = F^iUi^^, we see that when expressing a^-i.^ in terms of ai we need to keep all corrections 



that are linear in the matrix element F, 



fit 



fa 



E.s 



B„ 



Ely ^iy-*4 ~ J2i, ^ 



— A 



(42) 



Therefore 



OJ 



(43) 



The computation of is similar, this time all the corrections are negligible in the limit A^ oo. 
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Finally, we get the following closed system of iterative TAP equations that involve only matrix multiplication: 



t+i 



10 



Am + 



i^2. 

fit 



t+1 



t+l\2 Dt+1\ 



(44) 
(45) 

(46) 

(47) 

(48) 
(49) 



We see that the signal model P{xi) ~ {1 — p)6{xi)+ p(p{xi) assumed in the probabilistic approach appears only through 
the definitions ( 32p3| ) of the two functions fa and fc ■ In the case where the signal model is chosen as Gauss-Bernoulli, 
these functions are given explicitly by Eqs. ( 67p8| . Equations ( 44p9 ) are equivalent to the (generalized) approximate 
message passing of [7l [8] . 

A reasonable initialization of these equations is 



,t=o 



p dx X (f>{x) , 



(50) 

(51) 
(52) 



C. Further simplification for measurement matrices with random entries 



For some special classes of random measurement matrices F, the TAP equations 
Let us start with the case of a homogenous matrix F with iid random entries of zero mean and variance 1 /N (the 
distribution can be anything as long as the mean and variance are fixed). The simplification can be understood 
as follows. Consider for instance the quantity V^. Let us define V as the average of V"^ with respect to different 
realizations of the measurement matrix F. 

N N 

^ = E^-'' = ]^E-- (53) 

i=l i=l 

The variance is 

var V ^ - E - ^) - ^) + E (^^^ - ^) ' 

Since the average is of order one and the variance of order l/A^, in the limit of large N we can hence neglect the 
dependence on the index \x and consider all equal to their average. The same argument can be repeated for all 
the terms that contain F'^^. Hence for the homogenous matrix F with iid random entries of zero mean and variance 
l/A*", one can effectively "replace" every by 1/7V in Eqs. 



44p7| can be simplified further. 



46|[47) and (44|[45 1. The iteration equations then take 
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the simpler form (assuming for simplicity that = A) 



V = 


i 




■i 




A + 1/ 

a 


Ri = 





A + F 



(55) 

(56) 

(57) 
(58) 



/a(S2,i?,), (59) 

/c(s2,i?0- (60) 

These equations can again be solved by iteration. They only involve 2 (A/ + + 1) variables. For a general matrix F 
one iteration of the above algorithm takes 0{NM) steps (and in practice we observed that the number of iterations 
needed for convergence is basically independent of A'^). For matrices that can be computed recursively (i.e. without 
storing all their NM elements) a speed up of this algorithm is possible, as the message passing loop takes only 
0{M + N) steps. 

A second class of matrices for which a similar simplification exists is the case of the block matrices defined in 
Sec. IB For simplicity, we consider the case when the noise only depends on the block, i.e., A^ — Ag for all fi in 
block q. For the block measurement matrix with random entries of variance Jq^p/N the simplified TAP equations 
read 



E -^I'P 1^ 

p=i ieSp 



L 



1^ 1^ - A 



n 



Ri — 



X ^ ^qpJq,p 



1 



1 ^= 

M E ^^(A').P E ' 



ieBp 



, Ri 



fa 



(61) 
(62) 

(63) 

(64) 

(65) 
(66) 



where p = 1,2,... Lc, q — 1,2,... Lr- I{fJ,) (and /(«)) is defined as the index of the block to which /i (i) belongs, Bg 
is the set of indices in block q. We remind that agp — Mg/Np and Up = Np/N . 



D. Parameter learning with expectation maximization 



In our practical implementation, we use as signal model a Gauss-Bernoulli distribution. That is, the function <j){x) 
is Gaussian with mean x and variance tr^. The functions fa and fc are in this case: 



/a(S',i?) = 



(E2+cr2)2 



(l-p)e ^ +p^j^e 2(^^+-^) 

_r2 (H-x)2 (-R.--, .. . 



(67) 



^ il^-T^^^ [^2^2(22+ ^2) + (^5.2 +^^2)2J+^2g 

/,(E2,i?) = ^^±^1^ —2 . (68) 

(l-p)e 2i:2+p_^e ^^^^^ 
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See also appendix [C] where we give the form of fa and fc for the signal model consisting of mixture of Gaussians. 

The most likely values of parameters p, x, a, A can be obtained via maximizing the partition function. Within the 
belief propagation approach this is equivalent to maximizing the Bethe free entropy F{p,x,a, A) = log Z{p,x, a, A) 
expressed as [46] 



Fip, X, a, A) = ^ log Z^ + Y, log Z'^Yl log ^""^ ' 



where 



Z^= [ Y[dx,Y[m,^f,{x,) 



{l-p)dixi) + 



v/27r(A + V^) 



g 2(A + V^) 



(69) 

(70) 

(71) 
(72) 



The stationarity condition of Bethe free entropy ( 69 ) with respect to p leads to 



l/g^ + l/S- 



i-p 



T(l/o-2 + l/Sf)2 



(73) 



Stationarity with respect to x and a gives 



/9+(l-p)cr(l/f72 + l/I]2)ie + +5^ 

_(Hi/i:2 + a;/<T2)2 2 

/9+(l-p)(7(l/(T2 + l/S2)5e ^^(V^^^V^ 5^ 



(74) 



(75) 



For simplicity, we consider that the noise is homogeneous, i.e., A^ = A, for all p.. The noise level A may be unknown, 
in which case one can learn it, like the other parameters, by maximizing the free entropy. The resulting condition for 
learning of the noise variance A reads: 



A 



E 



E 



(76) 



where a;^ and are defined in Eq. ( 38 ) 



Note that instead of using the steepest gradient descent in the Bethe free energy for the mean and variance (i.e. 
Eqs. ( 74]|75 )) one can also use simpler expressions that are satisfied in the Bayes-optimal setting. In particular 



E»Q» 
Np ' 

E.(^.+ 

pN 



x\ 



(77) 
(78) 



In our numerical implementations we use these simplified conditions. In the case where the matrix F is random with 



iid elements of zero mean and variance 1/A'^, we can also use for learning the variance: EjliLi ufi/^ = ap{(T^ + cc^) 



Eqs. (73) and (74 



maximization. Eqs. (25 



75| 0^(77 
36 



26 



78) can be used for iterative learning of the parameters, in the spirit of expectation 
37 



[73} [77} [78| altogether lead to the Expectation Maximization Belief Propagation 
(EM-BP) algorithm that we have first presented in [IJ. In EM-BP one update of the BP messages is followed by an 
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update of the parameters and this is repeated till convergence (of both BP messages and the parameters). In our 
implementations we initialize the parameters as follows 

= a/10 , x*=° = , af^o = 1 • (79) 

In case the variance of the signal is not at all close to one, the sum rule ^ 17 i ^^i^i suggests a more 

sensible initialization CTj^q = ?/^/(M7VvarFp*"°). A new guess of parameters is obtained using Eqs. (|73| |77| |78| 
except if the variance becomes negative, then the new variance is set to zero, or if the new value of p becomes larger 
than a, in which case a is taken as the new value for p. To obtain an updated guess for the parameters we also 
use "damping". The updated guess is obtained as 1/2 times the old value plus 1/2 times the newly computed value. 
Empirically this speeds up the convergence and prevents some numerical instabilities. If needed, such damping is also 
used to improve convergence for the BP messages themselves. 



IV. ASYMPTOTIC ANALYSIS: STATE EVOLUTION AND REPLICAS 



Belief propagation is an efficient heuristic algorithm that is in some cases (such as the present one) amenable to 
asymptotic {N — )■ oo) analytical analysis. This statistical analysis of BP iterations is known as the "cavity method" 
(in statistical physics) [321 137] , the "density evolution" in coding [JS] , and the "state evolution" in the context of CS 
[3 115| . The corresponding equations can also be derived using the replica method, that provides an exact asymptotic 
analysis of both the BP performance and the performance of an optimal (perhaps exponentially costly) reconstruction 
algorithm. In this section we first concentrate (parts A to D) on the case of 'homogeneous' measurement matrices 
with iid entries. We derive the density evolution equations in part A, and we detail the replica approach in part B. 
Part C shows the simplifications that takes place in the Bayes-optimal case where the signal model gives the correct 
statistical properties of the underlying signal, and part D generalizes the density evolution equations to the case where 
one uses the learning procedure for the parameters of the signal model. Part E gives the density evolution equations 
in the more general case of block measurement matrices. 



A. Density evolution of the message passing 



We derive the density evolution equations in the case where the measurement matrix F has random entries that 
are iid, with mean and variance 1/N, and we assume that the parameters of the signal model are fixed. 

The density evolution (or cavity method) uses a statistical analysis of the BP messages at iteration t, in the large N 
limit, in order to derive their distribution at iteration t + 1. It turns out that these distributions are simply expressed 
in terms of two parameters: 



V 



1 V- t 

i=l 



i=l 



(80) 
(81) 



We remind the reader that are the components of the original signal s, and a*, w| are the mean and variance of the 



local beliefs defined in (28), at iteration t. V just measures the average variance of the local beliefs, and E is the 



mean-squared error achieved by BP, at a given iteration t. 



Using the definition of the quantity Ri (39) and Eq. (26), we get 



^ ^ ^fj-j ('^i 



(82) 



where is the measurement noise (as defined in (U])), a centered Gaussian variable with variance Aq. The variable 



J is a random variable with respect to the distribution of the measurement 



matrix elements F^J_i (zero mean and 1/iV variance matrix) and the noise Therefore r\ a Gaussian random variables 
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with mean and variance 



M N 



(83) 
(84) 



In the second inequaUty of (84) we neglected terms of 0{l/vN). 

Using the above results this leads us to the belief at iteration t + I, m*"'"^(xi), being distributed as 

ml+\x,) ^ —[{1 - p)S{x,) + p(f>ix^)]e- ^<^+'') (85) 

where z is a random Gaussian variable with zero mean and unit variance, and is a normalization constant. Hence 
using the definition of the BP order parameters given in (81 ) we get for a signal with iid elements 

'A + y* /£;* + Ao' 



V 



E 



^ d.s[{l~ Po)S{.s)+poMs)] J Vzf, 
*+i = / ds[{l-po)S{s)+poMs)] I -Dz 



/ A + y* , /£;* + Ao 

J a I , S + Z\ 



H 2 



(86) 



(87) 



where Vz = dze ^ /^/v^27r is a Gaussian integration measure. For the special case of a Gauss-Bernoulli signal model, 
i.e. when the function is Gaussian with mean x and variance cr^, the functions /q(5]^, R) and fd'^'^, R) are expressed 
exphcitly in Eqs . (|67||68p . 

Equations ( 86]|87 l are the density evolution equations. They describe how the mean-squared error E and the variance 
order parameter V evolve during the iterations of the BP algorithm. Note that the density evolution equations are 
the same for the message passing and for the TAP equations as indeed factors of 0{1/N) are neglected in the density 
evolution. If the messages are initialized as in ( 50]|52 ), the initial conditions of the density evolution equations are: 

-I 2 



E'^ 



•t=0 



V 



t=0 



Pqs'^ — 2ppqs / dx x(t>{x) + p'^ 



dx xd>(x) 



dx x(j){x) 



p I dx x'^4>{x) — (? 



(89) 



Fig. [T] shows several examples of this mapping for the noiseless case A = Aq = 0. We plot the evolution of the 
normalized vector (y(*+i) - - £;(*)). For a relatively high measurement density a, there is unique fixed 

point E = V = Q corresponding to an exact reconstruction of the signal. When a is below some critical point, another 
attractive fixed point i? > 0, y > appears. 



B. Replica analysis 

The density evolution presented in the previous section can also be derived independently using the replica 
. The main advantage is that the replica computations give a physical meaning to all the fixed points of 
), even to those that are not reached by iterating the BP algorithm. 
The thermodynamic properties of a disordered system given by the Hamiltonian defined in Eq. ([s]) are characterized 
by the average free entropy Ef,s,5 (log ^), where Z is the partition function defined in ([2|, s is the original signal and 
^ = {'C/ilJjLi are the measurement noise with zero mean and variance Aq for ^ = 1,2, . . . ,M. The free entropy is 
evaluated via the replica trick as 

$ - >F.s.c(logZ) = 1 hm ""--^^^"^'^ . (90) 

Introducing n replicas, we get 

ef,s,«(^")= 1 nd<n[(i-^)'^(<)+/''^(<)]nEF,s,c-^e-^^"-&^- (91) 
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Gaussian Signal, Gaussian inference, p=0.2, no spinodai 
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Binary Signai, Gaussian inference, p=0.15, no spinodai 
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Gaussian Signal, Gaussian inference, p=0.33, witli spinodai 
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FIG. 1. (color online) Examples of the BP density evolution, y-axes is the mean-squared error of the current signal estimate 
E — g — 2m + pos^, the x-axes is the average variance V = Q — q. Each arrow is a normalized vector (y - 't/ W , - S'*' ) . 

The signal model 0(a;) is Gaussian with zero mean and unit variance, the signal distribution (j)oix) is Gaussian on the top and 
{±1} on the bottom. The measurements are noiseless. On the left we show an example for relatively large measurement rate 
where there is a unique fixed point E ^ 0,V ^ 0. On the right there is another fixed point i? > 0, > which is the 
attractive one for "uninformed" initial conditions. Notice that on the top plots the line V = E is stable: this is thanks to the 
Nishimori condition when the signal is described by the correct model (po = p and (po = 4')- In that case one can work in the 
V — E sub-space. 



where a^b, . . . denote the replica indices, A is the assumed measurement noise and generally A 7^ Aq. 

In the case where the matrix F has iid elements with zero mean and variance 1 /N , we introduce the order parameters 
as follows 



= a=l,2,...,n, (92) 

i=l 

= a=l,2,...,n, (93) 



N 

i=l 
N 

I 

" N ■ 



4=1 

We use a common trick of rewriting the identity 



1= / HdQadQadmadma I ]^d(7a6dg„fce^"<3"[^Q»-|E,(-l)^l-E„<,9..(^9..-E,-^?)-E„™.(Af™.-E,-^?^ 

•'a •' a<b 
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When averaging Z", we first need to evaluate the quantity 



X„, = E 



(95) 



at fixed signal s and configuration x. In order to evaluate we first need to define ~ Y^^=i ^t^ii^i ~ ^f) + ?m 
with a = {0, 1, . . . ,n}, and where corresponds to the index of the signal: — Si. The obeys a joint Gaussian 
distribution with covariance 



1 



v^y = Ef,5 ^% - + Ao = - ^ (x° - Ao = g'' - 2m° + + Ao (96) 



Ef,^ [vlv\] = Ef,^ - - + Ao = g'^' - {m^ + m") + ps^ + Ao 



(97) 



We shall use the so-called replica symmetric (RS) Ansatz. This is consistent with using Belief Propagation, and it is 
known to be correct for the optimal Bayesian inference (i.e. when the signal model correspond to the empirical signal 
distribution) ^43^ .46). In this Ansatz the replicas are considered as equivalent, therefore: 



Going back to X^, we now have 



with 



Xf, =Ev [e^2iE:=i(<)' 



P(v) 



1 



v/(27r)"det(G) 

where (under the RS hypothesis) the covariance matrix reads 

Gaa = £;v(«) = Q + P^-2m + Ao, a - l,2,...,n. 
Gab ^ E^{vy'')=q + p^- 2m + Ao, a<b. 



Computing explicitly X^, one now finds 



X„ 



V(27r)"det(G) 



Dve 



det(l + f ) 



We now compute this determinant. We have 

G ^ {q + ps^ -2m + Ao) n +{Q - q)l , 



(98) 



(99) 



(100) 



(101) 
(102) 



(103) 



(104) 



where 11 stands for the n x n matrix with elements all equal to one. The eigenvectors of G are (a) one eigenvector 
(1, 1, . . . , 1) with an eigenvalue Q — q + n{q — 2m + ps'^ + Aq), and (b) n — 1 eigenvectors of the type (0, 0, 1, — 1, 0, . . . , 0) 
with eigenvalues Q — q. Therefore 



det(l 



A' 



A 



q + n{q — 2m + ps'^ + Aq) 



Q-q 
A 



To conclude the computation of X^ we get 



lim Xn — e 

n^O 



g-2,n + ps^ + Ao , Q-q . 



(105) 



(106) 



We thus obtain 



Ef.s,?^" = / ndOadQadm,dm,/'[]dgafcdg,be^[5 5:.Q.Q»-E 

a ah 



, qabqab-Y.a 



n 



V27fA 



(107) 



N 



dxo [(1 - Po)S{xo) + Po(I)q{xo)] Y[ dXa [(1 - p)S{Xa) + pcl^iXa)] ' 
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Let us call Y the expression in the {.} in the last equation. Introducing the following transformation into the last 
equation 

where Vz is a Gaussian integration measure with zero mean and variance one, we obtain under the RS hypothesis 
Y = Jdxo [(1 - po)S{xo) + PoM^o)] J Dz I^Jdx [(1 - p)S{x) + pcj,{x)] e-^-^+m-o+.v/?x| (io9) 

In then ^ limit, one can write that /(z)" = l + nlog/(z) and thus / DzJ{zY = 1+n / i:'zlog/(z) w J Dziog f(z) ^ 
Grouping all terms together we finally get 

Ef,s,«^" j dQdQdqdqdTOdTOe"^*('3'«'"''5,<?,™) (HO) 
where is the replica free energy function 

^{Q,q,m,Q,q,m) = A + Q-q 2 ^ ^ ~2 + y 

+ / ds[il-po)5is)+poMs)] I Vz\og[ I da;e-^-'+"-^+^^-[(l-p)<5(x) + p0(:r)]l. (Ill) 



We remind that "Dz is a Gaussian integration measure with zero mean and variance one, po is the density of the signal, 
and 0o(s) is the distribution of the signal components and = / ds s'^ (j)o(s) is its second moment, Ag is the true 
variance of the measurement noise. 

The physical meaning of the order parameters is 

^ = ^E(^?)' '^^^E^^^)'' ™ = ^E^^(^')' (112) 



in which the average is with respect to the measure P ([2|), whereas the other three m, q, Q are auxiliary parameters. 
Using the saddle point method and performing derivatives with respect to m, q, Q — q, rh, q, and Q + q we obtain 
the self-consistent equations 

A, - " - a{q - 2m + pos'^ + Aq) 

"^ = ^+'^ = aTo^' ' = — (A+Q-qr — ' ^'''^ 

m^po Jdss(j)ois) J Vz fa (^^,s + z^^ , (114) 

Q-q = [ds[{l- po)S{s) + poM^)] fvzfj^,s + z^}] , (115) 
J J \m m J 

q= [ ds [(1 - p^)5{s) + po0o(s)] / Vz fl f 4, s + z^\ . (116) 



From the definition of the order parameters (112) we obtain 

£; = g-2m + pos^, V = Q-q. (117) 

It is easily seen that the set of stationary point equations ( 113||116" ) exactly reproduces the fixed point condition of 
the density evolution equations ( 86|[87 1: BP fixed points are stationary points of the free entropy (111). 

The uniform sampling from the measure P, Eq. ([2]), is described by the global maximum of $. We can use equation 
(fill in order to confirm (non-rigorously) our previous result about the optimality of the probabilistic approach for 
any (f){x) with a support that contains that of the signal (/jq, and finite second moment. Indeed the free entropy 
evaluated close to the the signal i.e. when Q — q = m — pos"^, diverges as —{a — po) log(A + Q — q)/2. Therefore in 
the noiseless limit A — >■ 0, $ diverges when E,V 0, whenever a > po. 

It is useful to compute the free entropy restricted to configurations x at a fixed squared distance D from the signal, 
D = ~ Si)'^/N. When sampling from the probability P = P(x|F,y), in the limit of large N, the probability 
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that the reconstructed signal x is at a squared distance D = ~ /N from the original signal s is proportional 

to e''^*^-^) where $(-D) is the free entropy restricted to squared distance D. In order to compute we need to 

evaluate the following saddle point 

^D) = SPq ,^ Q .^<i>(Q, q,{Q-D + po{s^))/2, Q, g, m) , (118) 

and q = a{q — 2m + po{s'^))/{Q — g)^, and rh — Q + q. The resulting free 
entropy ^{D) is a useful quantity to visualize when the BP reconstruction fails. It will be shown and analyzed in the 
next section. 

Let us, at this point, underline the difference between distance D = J^iii^i ~ ^lY) 1^ ~ Q ^ 2rn + p^s^ and the 
mean-squared error E — '}2n{{xi) — /N = q — 2m + pqs^. Clearly D ~ E + V , and one should not confuse the two 
definitions. 



which can be done using Eqs. ([iiiliiel 



C. Analysis of Bayes-optimal inference 



So far we were discussing the general case when the signal is created using density po and empirical distribution of 
the non-zero elements and the belief propagation reconstruction algorithm is used with a signal model with density 
p ^ Po and entry distribution (f> ^ (f>o. As we explained in section (IIB) the Bayes-optimal inference corresponds to 



the case when the statistical properties of the signal, and the distribution of the measurement noise are known. Then 
one can use a signal model with 



Po ■ 



0(a;) = (l)o{x) , 



A = Ao. 



(119) 



In such a case exact sampling from the measure P ([2j) corresponds to the information-theoretic optimal way of recon- 
structing the signal. This means that the predictions obtained in this case represent the best possible reconstruction 
performances regardless of the algorithm used. 

The replica symmetric computation presented in the previous section becomes exact in this case, for reasons similar 
to those known in mean field spin glasses on the 'Nishimori line' [321 Hence in this Bayes-optimal case 
the above replica calculation can be used to study the information-theoretic limits for reconstruction in CS. This is 
equivalent to what was rigorously established by [HI [19] . 

The density evolution and the free entropy can be simplified greatly in the Bayes-optimal case, since the Nishimori 
condition (14) gives the following equalities: 



q = m , 



Q^ps^ 



E = V . 



(120) 



Hence in the Bayes-optimal case the density evolution is characterized by a single parameter, the mean-squared error 
E = ps^ — m. Note that the mean-squared distance from the sig nal to a con figuration sampled from the distribution 
P \% D = E = 2E. The density evolution equations JiipTTor (|ll3|ll6|) reduce to: 



E 



t+i 



p \ ds s ( 



A + E'^ 



(121) 



(we remind that the function fa is defined in (32)). The initial condition of Eq. (88 1 is E 



ps'^ 



p^s^. 



The free entropy also becomes a function of the single variable E: 

a{p~s^ + E) 



*nl(£^) 



a 
2 



a 



log (A -f E) 



2 "° ' ' ' 2(A + E) 
+ I ds[{l~p)5[s)+p<i>{s)] f Vzlog 



da;e^+^ 



x{s— %)-\-zx - 



[{l-p)S{x) + p<l>ix)] 



(122) 



When the signal distribution is known, the value of the MSE E at the global maximum of this free entropy provides 
the Bayes optimal reconstruction of the signal, i.e. the lowest achievable MSE given the knowledge of the measurement 
vector y and the measurement matrix F. As we will see, depending on parameters a, p and (f>{x), the BP algorithm 
where the MSE evolves according to (121 1 will either find this global maximum or it will get blocked in a local 
suboptimal maximum. 
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For completeness let us give the explicit form of the free entropy ( 122 ) for a Gauss-Bernoulli signal where 4>q has 
zero mean and unit variance: 



log (A + E) 



A + E 



/ Pzlog 



(1 - p)e 2(a + A + i3) _|_ 



2{a + A + E) 

py/ATE 



(123) 



VaTe~- 



p / Vzlog 



(1 - p)e2(A+B) + 



p^/A + E 



VA + E- 



a 



(124) 



In this case, the condition of stationarity of the free entropy, giving also the fixed-point condition of density evolution, 
takes the simple form: 



E = p 



P 



a + A + E 



Vz- 



(125) 



D. Density evolution with parameter learning 



We study here the general case where the signal is created using a density po a-nd empirical distribution of the 
non-zero elements 0o, and the belief propagation reconstruction algorithm is used with a different signal model, with 
density p ^ Po and distribution of the non-zero elements (f> (f>Q. In this case, expectation maximization can be used 
to learn the parameters, as described in Sec. |III D[ This modified BP procedure, including parameter learning, can 
also be studied with density evolution. We describe here the case that we use in our implementation, namely a model 
signal which is Gauss-Bernoulli, where (j) is Gaussian with mean x and variance cr^. The learning conditions ( 73|75 ) 
give the evolution of the parameters: 



/ ds [(1 - po)S{s) + poMs)] J P^ i_p(.)+p(t)g(s2^,+^[;) 
/ ds [(1 - po)S{s) + po<^o(g)] /Pz^(S]^ g + zU) 

g(Y.'^,s+zU) 



pW J ds [(1 ~ po)5{s) + poMs)] J P^ i_pw+p(.)g(s^s+.c/) 
2 (,+ 1) ^ Jds [(1 - po)6{s) + po0o(g)] /Pz[/,2(S ]^ . + zU) + JrXY?, s + zU)] 



pWJ ds [(1 - po)S{s) + poMs)] J T^Z i-pin+pin g^s^^s+zU) 
where the function g is defined as 



5(S2,i?) 



-g 2(i/i:^ + i/,T^) 



And we use 



U : 



Ao + Et 



The density evolution for the simplified learning (77j 78) reads 
1 



2N(t+l) ^ 



1 



ds [(1 - po)Sis) + poMs)] / Vz fai^',s + zU) , 



lit) 



ds [(1 - po)Sis) + poMs)] J Vz [/2(E^ s + zU) + f,{Y.\s + zU)] - 
The density evolution equations now provide a mapping 

obtained by complementing the previous equations on V, and E ( 86||87 1 with the learning update equations (126 



(126) 
(127) 
(128) 

(129) 

(130) 

(131) 
(132) 

(133) 
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128). In our implementation we initialize p' 



,t=o 



a/10, r 



0, and a I 



1. 



When a measurement noise is present the variance of the noise can be learned using Eq. ( 76 ) which in the density 
evolution becomes 



A(*) 



Ao + £;* 
1 + 

^ ^ AC) 



(134) 
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E. Density evolution for block matrices 



In the case of the block measurement matrices defined in Sec. I B one can easily generalize the above derivation 
of the density evolution and of the replica analysis. We just give the results here. For details of the derivation see 
appendix |Xj 

The order parameters are now 



9p 



ieB„ 



(135) 



in each block p E {1, . . . , Lc}. The free entropy analogous to that in Eq. ( 111 I becomes 
H{Qp}pLi, {qp}p=i, {mp}p^i, {Qp}p^i, UpjpLi, {™p}p=i) = 



E 



2 / ."-i"<?i 

9=1 



Qq -qq + A 



log (A + - qg) 



E 



QpQp 



p 



^rip ds [(1 - po)S{s) + po'/'o(s)] / 'Dz log <^ / d 



Ix e 



P=i \ 



mpiTip + 



[il~p)S{x) + p<f>ix)]\, (136) 



where we introduced 

La 

E 

p—i p—i 

The equations corresponding to the stationarity condition for this free entropy read: 



Pq = POS^ ^ Jqpflp, TOg = ^ dgpUpTJlp, qg 



La La 

— ^ ^ J qp^pqpj Qq — ^ ^ J qp^pQp ■ 



0=1 



(137) 



"pE 

9=1 

Lr 

m„ — n 



qp 



aqpJqpiqq " + Pg + Ag) 

{Qg -qq + A)2 



q^l Qq-Qq+A 

Qp ^ ^^p ~ Qp 



rrip^ Pa / ds s 4>o{s) / Vzfa — ,s + z — 



ds[{l - po)S{s) + poM^)] / /a 



1 








m 


nip J 








+ z— — 


nip 


nip 



(138) 

(139) 

(140) 
(141) 

(142) 

(143) 



When interpreted as a mapping (given the order parameters Qp, qp,nip at time t, one computes Qp, qp, nip form 
p^8|l40| , and then finds the new order parameters Qp,qp,nip at time t + 1 using ( |141|143| ), these equations are 
exactly the density evolution equations for the case of block matrices. These equations can be written in term of only 
2Lc order parameters, the mean-squared error Ep = qp — 2nip + p^s^ and the variance Vp = Qp — qp in each block 
p e {1, . . . , Lc}- The explicit form of the density evolution equations in terms of these 2Lc order parameters is: 



= / ds[{l-po)5{s)+poMs)] / -Dz 



fa 



- / ds [(1 - p^)5{s) + po</'o(s)] / Vzf, [^,s + z 







{-■■ 


+ z— — 


ynip 


nip 


1 


^^p\ 


— ,s + 




nip 


nip / 



(144) 
(145) 
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where: 



Too 



q=i A + X;^!! Jg^n^VV- 

L. 



^qpJqp 



(t) 



(t) 



s=l 



(146) 



(147) 



If one uses block measurement matrices together with expectation-maximization learning of the parameters, for a 
Gauss Bernoulli signal model, the density evolution equations for the parameters are: 



p(t+i) = p(t) 



^ Vz ds[il-po)Sis)+poMs)] 



,s + z- 



PoMs)] 



^(t+i) 



2\(t+l) 



T.PJ'^'J ds[(l-Po)<5(s) 

Y^fvz Us [(1 - po)^(s) + po<^o(s)] f ^ 

E / / d« [(1 - /'o)'5(s) + Po<^o(s)] - 
p=i-' J 1 



, s + z- 



(148) 



(149) 



s + 2;- 



TO. 



P \L. 
( 



P + P9(^,s + z^ 
Op 



(150) 



.s + z- 



TOo 



1 , ygp 



, s + 2;- 



TOo 



Too 



J_ _i_ V "^p 



F(t+1) 



(151) 



As in the homogeneous case, the density evolution equation of the block measurement matrices simplify in the 
optimal Bayesian approach, when the correct distribution_of the signal and its density are known po = Pi 4>o = 4>- In 
this case, the Nishimori conditions rUp = Qp and Qp = ps^ hold, hence Ep = Vp holds for every block p = 1, . . . , Lc- 
This leads to a single set of closed density evolution equations for the vector Ep, p = 1, . . . , Lc, that reads 



= I ds [(1 - p)S{s) + p4>{s)] I Vz 



fa 



1 



H 2 



TO„ 



-,s + Z- 



TO. 



EO^qp ^ qp 
Up- 



r(t) 



q=l A + X^^l^ JqrnrE\. 

In the case where (^0 is a centered Gaussian with unit variance, we get explicitly: 



E(^+^)=p--E-^ fvz 

mp + lj 



p+{l-p)e 2 ^TTLp + 1 



(152) 
(153) 

(154) 
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V. THE PHASE DIAGRAMS 



In this section we turn the equations from the previous section into phase diagrams to display the performance of 
behef propagation in CS reconstruction. We first discuss the noiseless case, with random homogeneous measurement 
matrices, this is a benchmark case that has been widely used to demonstrate the power of the £i reconstruction. 
We use measurement matrices with iid entries with zero mean and variance 1/iV (we remind that our approach is 
independent of the distribution of the iid matrix elements and depends only on their mean and variance) . Finally we 
discuss the phase diagram for noisy measurements, that present several interesting features. 



A. Noiseless measurements and the optimal Bayes case 



In Fig. [2] we show the free entropy density at fixed squared distance, for the Bayes-optimal case in which 

both (f)Q and (j) are Gaussian with zero mean and unit variance. The elements of the M x N measurement matrix F 
are independent random variables with zero mean and varia nce 1 /iV. 

The free entropy ^{D) is computed using Eq. ( [m| ) and pl8| which was derived using the replica method. The 
dynamics of the message passing algorithm (without earning) is a gradient dynamics leading to a maximum of the 
free-entropy <&(£') starting from high distance D. As expected, we see in Fig. |2] that $(£*) has a global maximum 
at D = if and only if a > po, which confirms that the Bayesian optimal inference is in principle able to reach the 
theoretical limit a = po for exact reconstruction. The left-hand side of the figure shows the existence of a critical 
measurement rate q;bp(po) > POi below which a secondary local maximum of ^{D) appears at D > 0. When this 
secondary maximum exists, the BP algorithm converges instead to it, and does not reach exact reconstruction. The 
threshold aBp(po) is obtained analytically as the smallest value of a such that $(1?) is monotonic. The behavior 
of ^{D) is typical of a first order transition. The equilibrium transition appears at a number of measurement per 
unknown a — po, which is the point where the global maximum of $(£') switches discontinuously from being at D = 
(when a > po) to a value D > 0. In this sense the value a = aBp(/9o) appears like a spinodal point: it is the point 
below which the global maximum of <&(!?) is no longer reached by the dynamics. Instead, in the regime below the 
spinodal (a < aBp(po): the dynamical evolution is attracted to a metastable non-optimal state with D > 0. 

On the right-hand side of Fig. [2] we show the evolution of the MSE as predicted by the density evolution equations, 
as well as the MSE measured using the BP algorithm for a system with size N = 15000. Below the spinodal point 
«Bp(Po) the MSE does not converge to zero, because the system is trapped in a metastable state. 
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FIG. 2. Left: The free entropy, $(-D), is plotted as a function of 7) = ( ^^^(a^i — Si)^ /N"^ for po = 0.4 and several measurement 
rates a in the Bayesian approach (when both the signal and the signal model are described by a Gauss-Bernoulli distribution). 
The evolution of the BP algorithm is basically a steepest ascent in $(-D) starting from a large value of D. Such ascent goes to the 
global maximum at D = for large value of a but is blocked in the local maximum that appears for a < aBp(po = 0.4) « 0.59. 
For a < po, the global maximum is not at D = and exact inference is impossible. Right: Using the same conditions as 
for the left figure, we show the evolution of the MSE measured experimentally during the iterations of BP for a signal of size 
A'^ = 15000 (data points) compared to the theory using density evolution (line). For the two lower measurement rates, where 
a < 0.59, the MSE saturates at a finite value. For the two higher ones it goes to zero. The full circles are for measurement 
matrices with iid Gaussian elements, the empty squares for matrices with iid ±1 elements. We see small finite size corrections, 
but otherwise there is excellent agreement between the two cases, as expected from the theory which states that only the mean 
and variance of the distribution of each matrix element matters. 



The spinodal transition is the physical reason that limits the performance of the BP algorithm. To illustrate this 
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statement, we plot in Fig. |3] the BP convergence time as a function the measurement rate a. As expected, the 
convergence time diverges around the spinodal transition asp- In the same Fig. [3] we also plot the MSE achieved by 
the BP reconstruction algorithm compared to the MSE achieved by the l\ minimization reconstruction for the same 
signal and the same measurement matrix as before. We remind that here we are still in the favorable case when the 
signal model was equal to the signal distribution p — po, 4>{x) = (f>oix)- 

Notice that the £i transition at ae-^ is continuous (second order), whereas the spinodal transition is discontinuous 
(first order). The transition at a^p is called a spinodal transition in the mean field theory of first order phase 
transitions. It is similar to the one found in the cooling of liquids which go into a super-cooled glassy state instead of 
crystallizing, and appears in the decoding of error correcting codes [IHl HE] as well. This difference might seem formal, 
but it is absolutely essential for what concerns the possibility of achieving the theoretically optimal reconstruction 
with the use of seeding measurement matrices (as discussed in the next section). 
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FIG. 3. The full (red) line (left y-axis) is the convergence time of the BP algorithm, defined as the number of iterations needed 
such that the MSE obtained by the algorithm at a given iteration does not change more than by 10"*^ in the next iteration. 
The data are obtained with the density evolution for a signal with density po = 0.4, where the non-zero elements of the signal 
are Gaussian with zero mean and unit variance. Reconstruction is done in the Bayes-optimal case. The BP convergence time 
diverges as q — > obp. The dotted lines (right y-axis) give the mean-squared error achieved by the BP algorithm (blue) and 
by the £i minimization (black) for reconstruction of the same signal. Exact reconstruction is in principle possible in the whole 
region a > po- The reconstruction with BP is exact for a > qbp(po ~ 0.4) « 0.59, whereas the i?i-reconstruction is exact only 
for a > 0.75. Note also in the regime a < qbp where BP does not reconstruct exactly the signal, the MSE achieved by BP is 
always smaller than the one of £i. 



In Fig. ID we show how the critical value asp depends on the signal density p and on the type of the signal, for 
several Gauss-Bernoulli signals. In this figure we still assume that the signal distribution is known, and hence po = P 
and 00 = We compare to the Donoho- Tanner phase transition that gives the limit for exact reconstruction 
with the £i minimization [7j 1241 149j , and to the information-theoretical limit for exact reconstruction a = p. 

Note that for some signals, e.g. the mixture of Gaussians $(2;) = [A/'(— 1,0.1) + A/'(l, 0.1)]/2, there is a region of 
signal densities (here po > 0.8) for which the BP reconstruction is possible down to the optimal subsampling rates 
a = pq. 



B. Noiseless measurements and the mismatching signal model 



In this section we show the performance of BP reconstruction and the corresponding phase diagrams in the general 
case when the density of the signal and the distribution of the non-zero signal elements is not known 



PO: 



(x) ^ (j)o{x) . 



(155) 



All the results we show are for the Gauss-Bernoulli model of the signal, i.e. (j){x) = e~(^'~^) /(^'^ ^/(-s/Sttct). As we 
argued in Sec. II A for noiseless measurements the probabilistic reconstruction for CS is optimal as long as a > po 
even if the signal model is not t he c orrect one, as in (155). This property can also be seen by analyzing the replica 

^ ~ ~ 



calculation of the free entropy (fill that close to exact reconstruction {Q 
a,s ^ —?■ -{a 



Pos^ 



Pqs^) behaves 



Po) log (Q — (7)/2. Unfortunately, in general, BP encounters a spinodal line (barrier) as in the case 
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FIG. 4. Phase diagram for the BP reconstruction in the optimal Bayesian case when the signal model is matching the empirical 
distribution of signal elements, i.e. (f>{x) = 4'q{x). The elements of the M x N measurement matrix F are iid variables with 
zero mean and variance 1/A'^. The spinodal transition qbp(po) is computed with the asymptotic replica analysis and plotted for 
the following signal distributions: (f){x) = Af{0, 1) (green), (f){x) = Af{l, 1) (blue) (j>{x) = [A/'(— 1,0.1) + A/'(l, 0.1)]/2 (magenta, 
equations needed to obtain this curve are summarized in appendix [C|. Note that for some signals, e.g. the third case, there 
is a region of signal densities (here po ^ 0.8) for which the BP reconstruction is possible down to the optimal subsampling 
rates a — po. The data are compared to the Donoho- Tanner phase transition ai-^{po) (dashed) for £i reconstruction that does 
not depend on the signal distribution, and to the theoretical limit for exact reconstruction a = po (red). The left hand side 
represents the undersampling rate a as a function of the signal density po. The right hand side shows the same data in the 
Donoho- Tanner notation, i.e. the number of nonzero elements in the signal per measurement is plotted as a function for the 
undersampling rate. 



discussed in the previous section. The position of this line (phase transition) depends on both the signal model (f>{x) 
and the signal distribution (f>oix). 

In Fig. [5] we show the phase diagram for Gauss-Bernoulli signal model, i.e. the distribution of components being 



P{x) = {1 - p)5{x) + p 



V2n 



(156) 



and various signal components distributions Poix) = (1 — po)6{x) + pQ(j)Q{x). Here we assume p = po- We see 
that the performance of BP mostly slightly decreases. For some signal distributions (e.g. the binary case (t>Q{x) — 
[5{x — 1) -|- 5{x -\- l)]/2) there is a narrow region of parameters in which the €i-reconstruction becomes better than 
the probabilistic-BP approach. 

In case the signal distribution and its sparsity are not known, the performance of BP can be improved by including 
the expectation maximization learning. We call this generalization EM-BP. In this paper we study the performance 
of EM-BP in the case where the signal model is Gauss-Bernoulli 



P{x) = {l-p)5{x) + p 



1 



crV27r 
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Expectation maximization is used to learn the three parameters p, x and a. In EM-BP we do one update of BP 
messages followed by one update of the parameters. New values of parameters are computed using Eqs 

[Pold 



78). BP message are then updated again using parameter values p 



+ min(p„ow, a)]/2, x = (xqm + ^ 



(73 
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a" = [cToIjj + max(cr^(,.jy, 0)]/2. And this is repeated till convergence. The evolution of parameters under learning is 
illustrated in the left part of Fig. [6j 

We observe that for the Gaussian-distributed signal elements (left part of Fig. [5]) the correct mean and variance are 
always learned (even in the region where exact reconstruction is not possible) . In this case the spinodal line is always 
the same as in the case when the signal distribution was known, see Fig.|4j For signals with non-Gaussian distribution 
of elements, right part of Fig. [5j the spinodal line changes slightly, the lines with learning are shown in the right part 
of Fig. [6l We conclude that EM-BP improves on pure BP and on .^i-reconstruction in many cases and hence it can 
be useful in practical situations. Of course if one has some knowledge of the signal distribution it is helpful to further 
include it in the signal model. 
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FIG. 5. Phase diagram for the reconstruction with BP when the signal model is not matching the empirical distribution 
of signal elements. The signal model is Gauss-Bernoulli with zero mean and unit variance. The measurement matrix is 
the homogeneous one with Gaussian iid entries. In this plot we assume that the signal density is known p — po. Different 
curves correspond to different distributions 00 of the signal. The dashed line gives the Donoho- Tanner transition line for £i 
reconstruction, which is independent of the signal distribution. 
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FIG. 6. Left: Learning of parameters for noiseless measurements. The signal is Gauss-Bernoulli with density po ~ 0.25, mean 
s — I and variance — = 0.5. The measurement density is a = 0.5. The EM-BP algorithm is initialized with p — 0.05, 
a; = 0, cr^ = 1. In the figure we plot the evolution of the parameters and of the mean-squared error E. The full line is the 
analytic prediction using density evolution, the data points is the EM-BP algorithm on an instance of A'^ = 12000, the full 
points are for a measurement matrix with Gaussian elements, the empty points for a matrix with elements ±1/N. Right: 
Phase diagram for the EM-BP reconstruction, that is when the signal model is not matching the empirical distribution of signal 
elements, i.e. 0(a;) 7^ (poix). Different curves correspond to different distributions (po of the signal. The dashed line gives the 
Donoho- Tanner transition line for £1 reconstruction, which is independent of the signal distribution. 



C. Phase diagram for noisy measurements 



In this section we discuss compressed sensing with noisy measurements, A > 0. We first describe the performance of 
the BP algorithm and the corresponding phase diagrams in the Bayes optimal case when the signal model corresponds 
to the signal distribution. In a second part we then discuss the general noisy case with non-matching signal model 
and learning. 

In Fig. [7]we plot the free entropy ^{E), obtained from Eq. ( |124 1, as a function of the mean-squared error E, for 
signal with nonzero elements being iid Gaussian variables with zero mean and unit variance, and a matching signal 
model. The main difference with the noiseless case, Fig. [2j is that the global maximum of the free entropy, that 
described the optimal achievable mean-squared error, is at non-zero values of the MSE. This indeed reflects the fact 
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FIG. 7. (color online) The free entropy in presence of noise as a function of the MSE. (a) p = 0.4 and A — 10~^, 

there is a first order phase transition and two local maxima do co-exist for region of subsampling rates ad > ot > as. (b) for 
larger noise, A = 10~^, there is always only one maxima, in this case the EM-BP approach is always optimal, although the 
mean-squared error may be quite large. 



that with noisy measurements exact reconstruction is no longer possible. 

Let us investigate whether BP algorithm finds a configuration with the best achievable MSE or not. Again, BP 
is basically performing steepest ascent in the free entropy starting from a large value of MSE. Depending on the 
value of the signal density p and the measurement noise variance A, we see two kinds of behavior as a function the 
subsampling rate a. For some values of p. A, see Fig. [t] (b), the global maximum of is the only maximum for 

all a, and in that case BP will converge to it. For other values of p, A, see Fig. [t] (a), the situation is similar to the 
noiseless case: 

• For a > ad the free entropy has a single maximum at a small value of MSE comparable to A. 

• For ad > a > die the free entropy has two maxima, the one at lower MSE being the global one. 

• For ac > a > as the free entropy has two maxima, the one at higher MSE being the global one. 

• For a < as the free entropy has a single maximum at a value of MSE much larger than A. 

The above result means that for a region of subsampling rates ad > a > a^ the BP algorithm is sub-optimal, as it 
converges to much higher MSE than the MSE corresponding to the optimal Bayes inference (global maximum of the 
free entropy). In the left part of Fig. |8]we plot the dependence of ad, ac, and as on the noise variance. In the right 
part we plot the MSE achieved by BP as a function of the subsampling rate. In cases where BP is suboptimal (for 
the two lowest noise variances) we compare to the optimal MSE. The data presented in Figs. [T] and |8] are obtained 
from the density evolution, i.e. N ^ oo limit of BP behavior. The behavior of BP for finite N agrees well with these 
results for systems sizes of several thousands of elements and more. 

In Fig. [9] we plot again the three phase transition lines for reconstruction with measurement noise. This time we 
plot the lines in the p-a phase diagram for several values of the variance A. As the noise increases the region of 
densities for which there is a sharp phase transition shrinks. For large enough values of A > 0.00078 there is no sharp 
phase transition for the inference of Gauss-Bernoulli signal (with matching Gauss-Bernoulli signal model) . 

Another illustration of this phase diagram with noise is in Fig. [lO] where we plot level lines following the MSE 
achieved by BP reconstruction. On the line ad the MSE of BP reconstructions increases discontinuously from values 
comparable to A to large values. 

Of course in practical applications the noise level Aq is often not known. In such cases learning of the noise level 



can be included in the EM-BP algorithm, using noise variance update Eq. (76). In Fig. 11 we illustrate the evolution 
of parameters and the mean-squared error E under such expectation maximization learning for a Gaussian signal of 
density po — 0.2, with measurement rate a — 0.5 and noise variance Aq = 10^"*. 
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FIG. 8. (color online) (a) The three phase transition lines in CS with noisy measurements for Gauss-Bernoulli signal and 
matching signal model with density p = 0.4. The blue line is the spinodal line Qs, red line is the dynamical line ad, and green is 
the critical line ac- For larger noise there is no such sharp threshold. A perfect sampling algorithm changes its behavior abruptly 
at Qc (the green line), where the quality of reconstructed signal would jump discontinuously from high MSB to low MSB. The 
BP algorithm (with the uninformed initialization) always converges to the local maxima of the free entropy corresponding to 
the largest MSB, hence its MSB jumps from a relative low value to a high value at aa (the red line). BP is hence suboptimal 
for Od > a > etc- (b) The MSB achieved by BP for several noise strengths. In the inset is the case of A = 10~* with the three 
phase transitions depicted. For Ud > a > Oc the best achievable MSB corresponds to the lower part of the curve, whereas BP 
reconstruction achieves the MSE corresponding the the upper part of the curve. Note that in this case the MSB achieved by 
£i reconstruction would be much larger (nonzero for a > 0.75 even for the noiseless case, see Fig. |3|. 




VI. SEEDING MATRICES: A WAY TO ACHIEVE OPTIMALITY 



In the previous section we exposed the reason why BP reconstruction for homogeneous measurement matrices F 
does not achieve subsampling rates down to the information theoretical limit a = po- In [1] we developed a new type 
of measurement matrices — that we coined seeding matrices — for CS for which the Umit a = pg is achievable using 
the BP reconstruction. This was built on several result in the error correcting code community |36f[5^ . Here we shall 
explain further our motivations for the construction of the seeding matrices. 

We shall give heuristic arguments why with these matrices it is possible to achieve theoretically optimal reconstruc- 
tion ande show, using the replica method (or equivalently, density evolution) that this is indeed the case. We want 
to point out that, while we use mostly the replica method/density evolution formalism, some rigorous results can 
be obtained. In particular, in the special Bayes optimal case — when the signal model corresponds to the empirical 
distribution of the nonzero signal elements — it has been now proven rigorously in |14) that the for CS with seeding 
matrices the BP reconstruction is indeed able to achieve the information theoretical limit a ~ pq. Here, we shall 
show, using the statistical physics tools, that seeding matrices allow close to optimal reconstruction also when the 
signal distribution is not known, which is even more appreciable. 
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FIG. 10. (color online) Phase diagram and level lines of the MSB for the BP algorithm in presence of noise in the Donoho- 
Tanner convention. Left: using a noise with variance A — 10"^". Right: using a noise with variance A = 10"*. The 
Donoho- Tanner transition line for £1 is shown for comparaison. 
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FIG. 11. Learning of parameters for noisy measurements. The signal is Gauss-Bernoulli of density po = 0.2, mean s = 0.5 and 
variance — = 1. The measurement rate is a = 0.5 and the noise variance Ao = 10~^. The EM-BP algorithm is initialized 
with p = 0.05, I = 0, (7^ = 1, A = 10"^°. In the fi gure we plot the evolution of the parameters and of the mean-squared error 
E for three cases. The full line is the density evolution, the data points is the EM-BP algorithm on an instance of N — 10000, 
the full points are for a measurement matrix with Gaussian elements, the empty points for a matrix with elements ±1/N. 



A. Why and when does seeding work? 

As exposed in the previous section, for homogeneous measurement matrices with iid entries, BP is able to reconstruct 
the signal correctly at a > asp, bellow aBP a metastable state (i.e. a local maximum of $(-0) at D > 0) appears in 
the measure P(x|F,y). The iterations of the BP algorithm get "trapped" in this state and BP is therefore unable to 
find the global maximum corresponding to the original signal (see Fig. [S]). This is a situation well known in physics, 
that is typical for a system undergoing a first order phase transition. A familiar example of first order phase transition 
being crystallization, i.e. the way a liquid changes into a solid. In physics, systems undergoing a first order phase 
transition can be divided into two groups: (a) Mean field systems, where the size of the boundary of a sphere of a 
(large) finite radius drawn around one particle (variable) is of the same order as the volume of this sphere, (b) Finite 
dimensional systems where the size of the boundary is much smaller than its volume. Typically in d dimensions, a 
sphere of radius r has surface Sdr'^~^ and volume v^r"^ {sd and Vd being the surface and volume of a sphere of radius 
one). 

In mean field systems metastable states have exponentially large (in the size of the system) living time, meaning 
that is would take an exponential time to randomly find a fluctuation that would be able to overcome the barrier 
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between the local maximum and the global one. Whereas in finite dimensional systems the living time of metastable 
states is always constant. A simplified argument leading to this conclusion uses the fact that maximization of the 
entropy is the driving force of system dynamics. Consider the system being in the metastable state (e.g. supercooled 
liquid), if a random fluctuation appears flipping a droplet of radius R into the equilibrium state (crystal) then this 
causes free entropy increase of A^w^i?'* and decrease because of the surface terms Fs^-R'^^^ for R large enough 
R > R* = rsd{d — 1) / (A^Vdd) the gain is more important than the loss and such a randomly created droplet will 
start to grow. The crucial point is that the critical radius R* does not depend on the system size N and hence such 
a fluctuation arises with a constant probability in the finite dimensional systems. The processus we described here is 
on the basis of nucleation theory in physics that described the growth of crystal droplets close to a first order phase 
transition [50l ISTl. 
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FIG. 12. Examples of seeding measurement matrices F for CS. Here — 8. (i) A band-diagonal matrix, already introduced 
in [l], where L = Lc — Lr, and Jp,, — 0, except for Jp_p = 1, Jp,p-i = Ji, and Jp-i,p = J2. Good performance is typically 
obtained with large Ji and small J2. (ii) Another band-diagonal matrix where L — Lc = L,-, and Jp^q — 0, except for Jp,p — 1, 



J, 



J, and Jp,p- 



1 with ui = 1, . . . , W. Good performance is typically obtained with small J and W > 2. Since all 



variances are lower or equal to one, this matrix can be realized only having elements (0,±1). (iii) A lower triangular matrix, 
that can be viewed as the matrices of type (ii) when W — L — 1. Again, good performance is obtained with relatively small 
J. (iv) In some cases, we observed that the last block of variables was not recovered correctly. Adding a new line in the 
matrix {Lr = Lc + 1), as in this example, cures the problem. All these matrices are motived by the same consideration: 
more measurement are made in the first block of the signal such that the information will first appear in this block, and then 
propagate into the whole vector. 



The whole idea of seeding matrices is to mimic the process of nucleation and crystal growth in the reconstruction 
of compressed sensing signal. This idea, together with the previous work on spatially coupled LDPC codes [38 , also 
motivated the design of the seeding matrix in ^j. There are three key ingredients that need to be present in the 
system in order for the seeding to work. 

(a) The free entropy driving force. To escape from a metastable state we need the existence of a higher maximum 
of the free entropy ^(D). This ingredient is present in the BP reconstruction of the original signal as long as 
a > po (or a > ac for the nosy case). Let us note here that seeding does not improve performance of the £1 
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reconstruction algorithms (see appendix |B]) , because this "driving force" is missing since the Donoho- Tanner 
transition is continuous (it is a second order transition in the physics classification). 

(b) The existence of a nucleus (seed). We need a part of the system to be already in the equilibrium state. This 
ingredient can be achieved by making the measurement matrix inhomogeneous and measuring at a much higher 
subsampling rate a small subpart of the signal - that we call a "seed" . 

(c) An interaction between the seed and the rest of the signal that enables the growth of the seed. In [T] and [T3] 
this was achieved via the so-called spatial coupling. The signal was divided into blocks and the measurements 
designed in such a way that only several neighborhooding blocks are measured at a time. Similar ideas have 
been used recently in the design of sparse coding matrices for error correcting codes [5^ . Here we also 
give an example of a seeded measurement matrix that does not have spatially coupled structure. 

In this article we present several ways how to achieve points (b) and (c) , and hence be able to do reconstruction in 
CS at yet lower subsampling rates. We, however, stress that there is relatively a lot of freedom in the construction 
of these matrices and their optimization and adaptation to physically constraint measurements is surely a promising 
area of future research. 

The matrix we used are presented n Fig.[T2j These are block-matrices defined as follows: The N variables are divided 
into Lc groups of Np, p — 1, . . . , Lc, variables in each group. We denote Up = Np/N . And the M measurements are 
divided into Lr groups of Mq, q = 1, . . . , L^, measurements in each group, we define aqp = Mq/Np. Then the matrix 
F is composed of Lr x Lc blocks and the matrix elements F^i are generated independently, in such a way that if /i 
is in group q and i in group p then F^i is a random number with zero mean and variance Jq^p/N. Thus we obtain a 
Lr X Lc coupling matrix Jq^p. For the asymptotic analysis we assume that Np — > oo, for all p = 1, . . . , Lc and Mq — > oo 
for all q = 1, . . . , Lr- The total subsampling rate is then a — ^q/(^p=i -^p)- The case of homogeneous matrix 

can easily be recovered by setting Lc — L,. = I. We define /(/i) or to be the index of the block to which /i or i 
belongs, Bq is the set of indices in block q. 

In all the examples of seeding matrices used in this article and presented in Fig. [12] the elements of the signal vector 
are split into Lc equally sized blocks {Np = N/Lc). The first block of measurements has size Mi and the other — 1 
measurement blocks have equal size Mq — {M — Mi)/{Lr — 1) for q > 1. In all the examples here we achieve the 
seeding by taking asccd — MiLc/N larger than q;bp > Poj and ctbuik — MqLc/N for q > 1 that can be approaching 
Po- The overall measurement rate is then 

asccd + {Lr — l)Q!bulk /, ^ 

a = . (158) 

Lc 

Hence a abuik as Ld Lr — > 1, and Lr — >■ oo. The matrix elements Fp^i are chosen as random i.i.d variables with 
variance Jq^p/N if variable i is in the block p and measurement /x in the block q. 



B. Seeding experiments for noiseless measurements 

In Fig. [13] we demonstrate how BP reconstruction works for seeded measurement matrices. We generated signal 
elements of density po — 0.4, the non-zero elements are Gaussian random variables with zero mean and unit variance. 
We obtained a = 0.5 noiseless measurements per signal element using seeded matrices generated as described above. 
We plot the mean-squared error in every block (different lines) as a function of BP iteration time. We compare 
a result from BP with its asymptotic density evolution behavior, obtaining excellent agreement. Note that in this 
case, the BP reconstruction for standard homogeneous matrices would fail. Notice that in both cases illustrated in 
Fig. [13] the first blocks are reconstructed fast and by interaction with the subsequent blocks the reconstructed region 
is propagated to the following blocks. 

Now that we illustrated that the BP reconstruction for large systems indeed agrees with the asymptotic density 



evolution analysis we plot in Fig. 14 two examples of the number of iterations (defined as the time when mean-squared 
error E < 10^^) it takes to reconstruct exactly signal of density po with measurement rate a po- 

In Fig. [15] we show how does the number of iterations needed for exact reconstruction depends on the number 
of blocks L for different signal densities pq. We see that in case of the one-dimensional seeding matrix of type (ii) 
the number of iterations depend linearly on the the number of blocks. The boundary of the reconstructed region 
is propagating as a kind of spatially localized wave at a constant speed, as illustrated in Fig. [I6] On the other 
hand for the long-range triangular matrices of type (iii) the number of iterations grows only as logarithm of the 
number of blocks, logL, (at least for large L). The propagation of the reconstructed region does not really correspond 



to a localized traveling wave, as visible from Fig. 13 (b). In both cases the speed of the growth of the seed (i.e. 
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a) 




b) 




FIG. 13. Reconstruction on the signal with seeded measurement matrices. The mean-squared error in every block is plotted 
as a function of the iteration time. We compare the numerical analysis of BP for a signal of N = 40000 elements with the 
analytic result obtained in the N ^ oo limit using density evolution. The agreement is very good. The density of the signal is 
po ~ 0.4, the non-zero elements are Gaussian with zero mean and unit variance. The measurement rate is a = 0.5. The two 
(a) The seeding matrix of the type (ii) from Fig. 



cases are: 
L = 15, J = 0.01 and W = 2 
Qfsccd = 0.68, Obuik = 0.48, L = 



12 



with i.i.d. 0,±1 random elements, 
(b) The seeding matrix of the type (iii) from Fig. 
10 and J = 0.1. 



12 



with 



asocd = 0.7, Qbuik = 0.485, 
Gaussian random elements. 
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FIG. 14. Reaching the Of — >■ po limit. Number of iterations needed to find the original signal of density po — 0.499 for (a) a 
Gauss-Bernoulli signal and (b) a 0, ±1 signal. In both cases, we used BP with a Gauss-Bernoulli signal model with p = pQ. 
The blue line shows the BP convergence time for homogeneous matrices, that diverges at the spinodal line asp- The red line 
shows the BP reconstruction done with type (iii) seeding matrices: (a) using Oseed = 0-8, abuik ~ 0.5, and J = 2.10"^ (b) 
using ascod ~ 1, Qbuik = 0.5, and J — 0.01 (in this case we added one block of measurements, Lr = L + 1). As L increases in 
both cases, the total measurement rate a decreases and approaches abuik = 0.5 m po — 0.499. The number of iterations needed 
for exact reconstruction then diverges with L. The difference between the reconstruction limits of BP and of £i is striking. 



reconstructed region) is proportional to the interaction strength between the first non-reconstructed block and the 
seed. In the case of one-dimensionally coupled matrices this strength does not depend on the position of the seed 
boundary. In the case of triangular seeding matrix the strength is proportional to the size of the already reconstructed 



region, hence SL/St ^ L, which gives the logarithmic dependence seen in Fig. 15 



In Fig. [16] right, and Fig. [M] right we show that the BP reconstruction with seeding matrices works also in the 
case when the signal model does not at all correspond to the actual signal distribution. In the two figures the signal 
components are 0, ±1, whereas the signal model was still Gauss-Bernoulli. Since the probabilistic approach is optimal 
for noiseless measurements even when the signal distribution is not known (as proven in Sec. II A) the seeding strategy 
is able to approach the information theoretic limit a — > po also in this case. 

We have not done expectation maximization learning in the data presented in this section, but this strategy is also 
useful with the seeding matrices and is included in our implementations. Its behavior is analogous to the one in the 
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FIG. 15. Number of iterations needed for reconstruction with type (ii) seeding matrices (on the left) and type (iii) seeding 
matrices (on the right). With type (ii) matrices, a wave is propagating in the system with a constant speed, while for type 
(iii) matrices with long range interactions the speed is proportional to L, hence the total time scales as logL. Left: we used 
J = 0.02, W — 2, asccd ~ 1.0, Qfbuik = 0.5. Right: we used J = 0.01, ascod = 1.0, Qbuik = 0.5. We used p = po to make these 
data. 





FIG. 16. Left: Evolution of the mean-squared error in each block as a function of iteration time. Here we used type (ii) 
seeding matrices with W = 2, L = 50, ftseod = 1-0, Qbuik = 0.5, and J = 0.01. With that type of matrix, the boundary 
of the reconstructed region is propagating as a localized wave. Right: Same as Fig. [13] for a "adversary-case" signal having 
components 0,±1, with A'^ = 10000. We used a = 0.6 and po = 0.4, with the seeding matrix of type (iv) with L = 10, 
Cfsccd = 1.0, Qfbuik = 0.5, J = 0.1. Exact reconstruction is achieved even thought the signal model (Gauss-Bernoulli) does not 
correspond to the empirical signal distribution. 



case of homogeneous matrices, as discussed in Sec. |VB| 



C. Seeding experiments for noisy measurements 



Every CS method requires robustness with respect to the measurement noise. In Sec. |VC"] we analyzed the phase 
diagram under measurement noise. In particular we showed existence of two phase transitions ac(/3o) and ad{po) (see 
e.g. Fig. [s]) such that for a ^ (ac,Q;d) and for the signal model matching the empirical signal distribution the belief 
propagation inference is as good as the optimal Bayesian inference. In other words the final MSE achieved by BP for 
a < or a > is the best achievable for a given measurement matrix F. If a stronger noise robustness is required 
then one would have to use a different measurement protocol or much larger sampling rate a. The only region that 
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is open to improvement is for measurement rates ac < a < ad- With seeding we can indeed improve considerably 
the final MSE in this region. The noise stability of the seeding strategy was touched already in [T], see also [Mj for a 
rigorous discussion. 

The performance of the seeding strategy in the presence of noise can be again studied using the replica/density 
evolution equation. In Fig. 17 we illustrate the evolution of the MSE for CS with noisy measurements for subsampling 
rates ac < a < ad for which BP with the homogeneous matrices gives a MSE much larger than the noise variance A. 
Again, in order to have a working seeding mechanism, the free entropy associated with the fixed point of BP close to 
the solution must dominate the free entropy associated with the meta-stable state. This is the case for measurement 
rates ac < ctbuik < ctd and can thus be exploited. 

Note, however, that in the presence of noise the free energy difference between the global and local maxima is finite 
(whereas it was diverging for the noiseless case), this means that the seeding matrices need to be constructed with 
more care in order to saturate the threshold. In particular the interaction width W (see Fig. 12) has to grow when 
the threshold ac is approached. 




a) 



2000 3000 

Iterations 




b) 



40 60 

Iterations 



FIG. 17. Seeding matrices with noise: Evolution of the MSE in each block, as in Fig. |16[ but in the noisy case. Here we used 
type (iv) seeding matrices with W = 2, L = 100, ascad ~ 1.0, Obuik = 0.5, and J — 0.001. Left: Even with a large noise with 
standard deviation \/A = 10~^, the front wave is still propagating with a finite speed, leading to a reconstruction with a final 
MSE of the order of A. This demonstrates the robustness of our approach with additive noise. Right: When the noise (here 
\/A — lO"'^) is too high (so that Obuik < Oc, see text) there is no such propagation. Here, only the very first blocks (the first 
one having a^ccd = 1 and its close neighbors) go to low MSE while the rest of the system stays far from the solution. Essentially 
all the changes are done in the 100 first iterations shown here. In this case, the seeding matrices does not bring improvement 
(and no other method could in this case). 



VII. CONCLUSION 

This paper presents a detailed analysis of the new strategy for compressed sensing that we introduced in [T] . With 
respect to this earlier work we have provided here a more detailed study of the phase diagrams and of the associated 
phase transition for BP reconstruction algorithm and for the (intractable) optimal reconstruction. We have treated 
in detail the case of noisy measurements and we have shown that our approach presents excellent stability with 
respect to noise, in the sense that the BP algorithm (with seeding if needed) is able to reconstruct the signal with 
mean-squared error as low as the optimal inference algorithm based on exhaustive enumeration (which is of course 
not computationally tractable). We have discussed reconstruction in the case of mismatching signal model and 
signal distribution and we have shown that in the noiseless case this mismatch does not pose a serious problem. We 
have introduced and studied new types of seeding measurement matrices with which we were also able to achieve 
reconstruction at almost optimal reconstruction rates. 
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Appendix A: Derivation of the replica analysis for block matrices 



Here, we rederive the replica analysis for the seeding matrices described in the main text Sec. VI This follows 
closely the derivation presented in Sec. |IVB[ we only need add the block indices. To evaluate the average of the 
replicated partition function (91 ) for the block matrices F we introduce the order parameters per block 



a= 1,2,. 



a = 1,2 



(Al) 
(A2) 
(A3) 



where Bp represents the index of the variables in block p = 1, . . . , Lc- 

We introduce a Dirac delta function that fixed the order parameters and we make use of the following integral 
representation for delta function: 



1= jWdQldQl Jl dgfdqf J|dm^dm;;e^^^=i^"=i'^^^^^^^"^'^«p^"^^'^ 



a,p 



(A4) 



Inserting Eq. ( |A4[ ) into the expression of Ef,s.c (^"), Eq- (911, we get 

Ef,s,c(^")= / ndQpdQ^dTO^dm;;e^p=i^45'5?'2?-"p'^'"^] [| dgf dgf e"' ^^=i 



I. a 

La 



72^ 



Re 



When averaging Z", we again need to evaluate the quantity X^, defined in Eq. (95). We define m° = Si^i ^t^i^f^ 
with a — {0,1, ... ,n} where corresponds to the index of the signal, x'^ — Si. The quantities then obey joint Gaussian 
distribution with Ef,^(u°) ~ and 



La La 

p—i p—i 

La La 

IEf,« («) - Y JiMp'^pQp ' IEf,4 (u^y^) = ^ Jii^.)pnpQ; . 

p—i p—i 

Under the replica symmetric ansatz the replicas are considered equivalent, i.e. 

Trip — Trip, Qp — Qp, Qp Qp . 

We introduce pq, rhq, qq, Qq as follows 



Pq = Pqs"^ Y^ Jqp^p, ™g = X! Jqp^p^^P' <iq 



La La 

^ ^ Jqp^pQpt Qq ^ ^ JqpTlpQp . 
p—1 P— 1 



And thus u° = uj'j — wj^ + ^p, a = 1, 2, . . . , n are also joint Gaussian distributed with zero means and 

Gaa = Ef,^{v1^v'^) = (57(^) + - 2toj(^) + Aq, a = 1, 2, . . . , n , 
Gab = Ef,5(w^uJ^) = qi(p) + pi[^) - 2to/(^) + Ao, a<b, 



(A5) 
(A6) 

(A7) 
(A8) 



(A9) 
(AlO) 
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where G is the inverse covariance matrix. For the block matrices we have 



1 

det(l + ^) = e 



(All) 



From here foUowing the same steps as for derivation of Eq. ( 111 ) we obtain 



q=l 



qq - 2rhq + + Ap 



+ log (A + - qq) 



QpQp ~ QpQp 



+ 5^np / ds[il-po)6{s)+poMs)] / 2?^log / da;e-^V^" [(1 - + 



(A12) 



where = Mq/N. From here we obtain the expression of free entropy in Eq. (136). 



Appendix B: Phase diagram of the l\ reconstruction for seeding matrices 



In this section, we apply the well-known li norm reconstruction for the seeding matrix (i) in Fig. 12 i.e., for the 
coupling matrix, Jp^p — 1, Jp^p_i = Ji,Jpi,p = J2 and others are zeros, and see if the delicate designed matrix can 
also provide substantial improvement for the reconstruction limit. 

In order to study the ii norm, we use the large /3 limit of the problem defined by the partition function 



N M / \ 

i=l /j=l V i / 



(Bl) 



We can use all our previous replica computation with the substitution in the local measure of (1 — p)(5(xi) + p(f>o{xi) 
by e~'^l^'l. In the case of our seeding matrix F, this gives Z — Je"^* with 

H{Qp}p=i, Up}p=i, {mp}p=i: {Qp}p=i, {qp}p=i, {™p}p=i) = 



1 



= 1 



(jp - 2rhp + pp 
Qp Qp 



+ log {Qp - qp) 



E 



p=i 



QpQp , QpQp 

— - — - — m„m„ H — ^— ^ 
2 P P 2 



+ E / d,s [(1 - p)Sis) + Ms)] log I / dx 



Qp+qp J2 



+a;(mpS+ZY'?p)g— (9|a;| 



where we always use (137): 



Pp^ Po{s^)^Jpq, T^P = ^J] 
q=l 9=1 



L L L 

pq'^qi Qp — ^ ' JpqQq: Qp — ^ ^ JpqQq 



q=l q=l 

In the large j3 limit, we assume the scaling for the order parameters as follows: 

Qp + Qp^ PRp , % = /3^rp , rhp = (ijlp 
Qp = 0(1) , qp = 0(1) , nip = 0(1) , Qp - qp = 0(l//3) 

and we write specifically Tp — P{Qp — qp). Therefore, the free entropy $ scales linearly in 



$ _ 1 



qp - 2ihp + Pp 



E 



QpRp 



nipfXp 



^ y y ds [(1 - pii)5{s) + (?!)o(s)] mm {^x^ - {PpS + ^/fpz)x + 



(B2) 



(B3) 



(B4) 



(B5) 
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It is easy to check that, for L = 1, this gives back the free energy written e.g. by Kabashima et al. |24| . 

In order to study the transition, we assume the following scaling when one is near to the regime of exact retrieval 
of the signal: 



Vpe {!,..., L} : fip^ 



oo 



(B6) 



With the above scaling we find that the saddle point equations of order parameters Vp, Ep — qp — 2mp + po{s ), fip 
and Tp are independent of the distribution of the nonzero elements in signal <j)Q{x)^ as long as (j)Q{x) = 0o(~2;). For 
L = 1, i.e., the canonical matrix F , the saddle point equations are given as : 

1 



E 



1 



a 
r 
a 



Pq + 2(1 - po) 
2(1 -Po) 



Dz 



Dz {z - \f + po{l + X^) 



[q - 2m + pa{s^)] = — [q - 2in + pa{s^ 



(B7) 

(B8) 
(B9) 
(BIO) 



They can be simplified further as a closed system of two variables a, A: 

a = po + 2(l-po) / Dz, 



/OO 
Dz (z-A)2 + po(1 + A2). 



(Bll) 



They give the critical value of a for a given value of po, these are the equations of [71 [^. 

For the seeding matrix, L >2, due to the fact that the final result does not depend on 0o (for symmetric ones), we 
thus take 0o as a centered Gaussian distribution of variance one. After some work we get: 



(B12) 



En — 





where 



H(a) 



Dz 



and 



'ip{a,b) 



Dz {z-hf = {l + b^)H{a) + 



\f2i 



/2 



(B13) 
(B14) 

(B15) 

(B16) 
(B17) 



The following table gives the reconstruction threshold po obtained for the same values of the a with probabilistic 
reconstruction (left) and l\ (right). The results are obtained by optimizing over Ji, Ji in the window [0.03, 1]. We 
notice that the results of l\ with optimal Ji, J2 are slightly worse than the results of £1 with just one block L = 1. 

Altogether, this demonstrates that the gain in performance using seeding matrices is really specific to the Bayes 
inference approach and hence it is the combination of the probabilistic approach, the message passing reconstruction 
with parameter learning, and the seeding design of the measurement matrix that is able to reach the best possible 
performance. 
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Po 


a 




Obulk 


Ji 


J2 


L 






Q 


'^seed 


abulk 


Ji 


J2 


L 


Po{L = 1) 


0.1 


0.130 


0.3 


0.121 


40 


1.2 


20 




0.014 


0.130 


0.3 


0.121 


0.03 


0.31 


20 


0.016 


0.2 


0.227 


0.4 


0.218 


10 


0.8 


20 




0.056 


0.227 


0.4 


0.218 


0.03 


0.31 


20 


0.059 


0.3 


0.328 


0.6 


0.314 


8 


0.4 


20 




0.096 


0.328 


0.6 


0.314 


.097 


0.57 


20 


0.100 


0.4 


0.426 


0.7 


0.412 


4 


0.4 


20 




0.145 


0.426 


0.7 


0.412 


.03 


0.57 


20 


0.150 


0.6 


0.624 


0.9 


0.609 


2 


0.2 


20 




0.278 


0.624 


0.9 


0.609 


.175 


0.57 


20 


0.283 


0.8 


0.816 


0.95 


0.809 


2 


0.2 


20 




0.476 


0.816 


0.95 


0.809 


.175 


0.31 


20 


0.481 



TABLE I. Parameters used for the probabilistic BP reconstruction of the Gaussian signal (left) and with the seeded £i. 

Appendix C: Equations for a mixture of Gaussians 



We consider here the case when the signal model is a mixture of G Gaussians 

G 



(CI) 



a=l 



where Wa are non-negative weights Y^'^=i Wa — The functions fa and fc ( 32p3 1 needed by the BP algorithm are 
then 



(1 - p)e 2^2 + pY.'^^^Wa 



= ^^^^^^ 



[a^S2(S2 + aD + (^aS2 + i?a^)2] 



n2^^2 \ 



For a signal that itself is a mixture of Gaussians 



- f 

J a ' 



(C2) 



(C3) 



io(x) = 5^^°AA(xS,(a°)2)^ 



(C4) 



the density evolution equations ( 114||116 ) simplify into single-Gaussian-integral equations 



G I G I 

A — / / vfi \ ra^ ^ — ^ / m V 

a— 1 a— 1 

J mm ^ J m V 



a=l 
Go 



y = (l-po) [vzU-,z^)+poJ2w^, fvzf,{l,zJiaO) 



m^ 



(C5) 
(C6) 



where we used integration per-parts to obtain the simplification in the first equation. We took advantage of the fact 
that a double Gaussian integral of a function that depends only on a sum of the Gaussian variables can be written as 
a single Gaussian integral with variance being the sum of variances and mean being the sum of means. We remind 
1/m = (A + V)/a, and q/m^ = (Aq + E)/a. 

Under the optimal Bayesian inference when 4>a{x) — (j){x), po — p, A — Aq the system of two equations reduces 
into a single one, since E — V and q = m. 
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